Evaluating Agentic Optimization on Large Codebases

Large language model (LLM) coding agents increasingly operate at the repository level, motivating benchmarks that evaluate their ability to optimize entire codebases under realistic constraints. Existing code benchmarks largely rely on synthetic task…

Authors: Atharva Sehgal, James Hou, Akanksha Sarkar

Evaluating Agentic Optimization on Large Codebases
F O R M U L A C O D E : Evaluating Agentic Optimization on Lar ge Codebases Atharva Sehgal 1 * James Hou 2 * Akanksha Sarkar 3 Ishaan Mantripragada 2 Swarat Chaudhuri 1 Jennifer J . Sun 3 Y isong Y ue 2 Abstract Large language model (LLM) coding agents in- creasingly operate at the repository lev el, mo- tiv ating benchmarks that ev aluate their ability to optimize entire codebases under realistic con- straints. Existing code benchmarks largely rely on synthetic tasks, binary correctness signals, or single-objectiv e e v aluation, limiting their ability to assess holistic optimization behavior . W e in- troduce F O R M U L A C O D E , a benchmark for ev al- uating agentic optimization on large, real-world codebases with fine-grained, multi-objectiv e per - formance metrics. F O R M U L A C O D E comprises 957 performance bottlenecks mined from scien- tific Python repositories on GitHub, each paired with expert-authored patches and, on a verage, 264 . 6 community-maintained performance work- loads per task, enabling the holistic ability of LLM agents to optimize codebases under realis- tic correctness and performance constraints. Our ev aluations rev eal that repository-scale, multi- objectiv e optimization remains a major challenge for frontier LLM agents. Project website at: https://formula- code.github.io . 1. Introduction Large Language Models (LLMs) for code are rapidly e v olv- ing from isolated function-le vel synthesis to file-le vel edit- ing, and now , to repository-lev el optimization ( Merrill et al. , 2026 ; Jimenez et al. , 2024 ; Zhang et al. , 2025 ; Zhao et al. , 2024 ; Shetty et al. , 2025 ; Ma et al. , 2025 ). These mod- els are now transitioning from assistants into autonomous coding agents, increasingly tasked with na vigating com- plex, interconnected softw are ecosystems to diagnose bot- tlenecks and improv e performance. Howe ver , we currently lack frameworks to study these emerging capabilities for * Equal contrib ution 1 The Uni versity of T exas at Austin 2 California Institute of T echnology 3 Cornell Uni versity . Corre- spondence to: Atharv a Sehgal . Pr oceedings of the 42 nd International Confer ence on Machine Learning , V ancouver , Canada. PMLR 267, 2025. Copyright 2025 by the author(s). the full optimization lifecycle; for example, ho w agents bal- ance multiple workloads, maintain function integrity , and structure improv ements at different le vels of the codebase hierarchy . While there exist coding benchmarks based on real GitHub repositories ( Jimenez et al. , 2024 ; Zhang et al. , 2025 ; Zhao et al. , 2024 ), the y generally do not fully capture the multi- workload real-world tasks that engineers and researchers face in practice. These benchmarks often rely on binary pass/fail feedback, which is insuf ficient for measuring opti- mization, or synthetic (e.g., LLM generated) tasks, which lack the complexity and characteristics of real-w orld coding. For e xample, real-world optimization is rarely isolated, diag- nosing and improving performance often requires reasoning about architectural decisions, component interactions, and design trade-of fs on the system-le vel rather than tuning an isolated function ( Balsamo et al. , 2004 ; W oodside et al. , 2007 ; Jin et al. , 2012 ). Consequently , this requires a ne w ev aluation standard capable of measuring the emerging abil- ity of agents across this entire optimization workflow under realistic software engineering constraints. W e identify several directions for impro ving agentic coding benchmarking: (1) Fine-grained metrics: e v aluation must mov e beyond binary correctness to capture continuous per - formance changes and trade-of fs; (2) Real-world measure- ments: metrics should be derived from established execution en vironments (e.g., standard profiling suites) rather than syn- thetic proxies; (3) Reliable baselines: agent performance must be assessed against human optimization to provide a meaningful standard; and (4) Repository scale: agents must operate within large, e volving codebases. W e introduce F O R M U L A C O D E 1 , a novel benchmark de- signed for adv ancing agentic optimization on large, e volv- ing software ecosystems. F O R M U L A C O D E is constructed from 957 real-world performance bottlenecks mined from 70 scientific, open-source Python repositories, like P andas, 1 F O R M U L A C O D E draws inspiration from Formula 1, where constructors must optimize entire systems—not just individual components—to achiev e peak performance on the track. Simi- larly , F O R M U LA C O D E challenges code agents to perform holistic, codebase-lev el optimizations, reflecting the complexity and inter- dependence found in real-world software. 1 Performance Issue #13479 : Performance of Angle, Latitude and Longitude is a major bottleneck in coordinate transforms Codebase Generated PR [ +4 -5 ] astropy/coordinates/angles core.py angles.py astropy/ README.md tests/ benchmarks/ Crowdsourced Metrics def time_init_scalar () -> time[ns] : def time_init_array () -> time[ns] : Snapshot tests Unit tests time_init_scalar time_init_array advantage over human #tests passed FormulaCode Task Multi-workload optimization +57 more metrics 🏎 Coding Agent 5% Large Edit Performnace Small Edit Performance T ime → Figure 1: F O R M U L A C O D E is a continuously updating benchmark for e valuating the holistic ability of agents to optimize large codebases. Each task in F O R M U L A C O D E comprises a problem description of a performance regression from GitHub, an en vironment containing a baseline repository snapshot, and multiple e xpert-written cro wdsourced performance workloads, along with the tools to execute them. An agent’ s performance improving edits are assessed based on their ability to outperform expert-written edits in optimizing multiple workloads while meeting multiple forms of correctness guarantees. Scikit-learn, and SciPy . Unlike previous datasets, each task in F O R M U L A C O D E is paired with an av erage of 264 . 6 community-maintained performance workloads alongside expert-authored patches. This unique construction enables the use of the airspeed-v elocity (asv) frame work to assess the full lifecycle of optimization (triage, diagnosis, and res- olution) in a way that isolated coding tasks cannot. W e conduct a large-scale ev aluation of frontier and open weights models (GPT -5, Claude 4.0 Sonnet, Gemini 2.5 Pro, Qwen 3 Coder) within multiple agentic frame works (T erminus 2, OpenHands). Our main findings are: • Agents generally can improve run-time performance, b ut perform worse than human experts (§ 3.1 ). • Agents are better at local or function-le vel optimization, rather than repository-lev el optimization (§ 3.2 ). • Agents excel at using specific optimization strategies (e.g., parallelizing or batching) and struggle with others (e.g., vectorized operations) (§ 3.3 ). • Agent performance relati ve to experts can vary dramat- ically by popularity of the repository , performing w orst on the 4th quintile and best on the 2nd quintile (§ 3.4 ). • Despite being more expensi ve per call, agents using fron- tier LLMs are o verall more cost ef fectiv e than those using open weights models (e.g., due to open weights models having much longer reasoning chains) (§ 3.5.1 ). • Compared to human experts, agents mak e less fav orable performance–cost trade-off decisions (§ 3.5.2 ). • W e observe minimal ef fects from data leakage (i.e., using LLMs potentially trained on expert solutions) (§ 3.5.3 ). W e open-source F O R M U L A C O D E as a community resource 2 , to not only measure what code agents can generate, b ut to understand how they can reliably optimize and maintain complex real-world systems. 2. F O R M U L A C O D E Benchmark Design Each F O R M U L A C O D E task e v aluates the ability of an agent to optimize a real-world codebase under strict correct- ness constraints. A task begins with a baseline repository , denoted Code 0 , which represents the unmodified implemen- tation. The agent operates on Code 0 and produces a modi- fied version of the repository , denoted Code agent , by making arbitrary repository-lev el edits. Each task is paired with two forms of e valuation signals: • Correctness. Correctness is measured via a suite of tests on the functional beha vior . A proposed code modification is considered v alid only if Code agent passes all tests that Code 0 passes. • Perf ormance W orkloads. Each task includes a large collection of expert-written performance w orkloads that ex ercise kno wn performance-critical ex ecution paths in the codebase. Each workload measures a single perfor- mance dimension, such as runtime or memory usage, and may exhibit natural v ariability due to execution noise. 2 Project website at https://formula- code.github.io/ . 2 Figure 1 depicts our benchmark setup. The top half sho ws a task from the Astropy repository , highlighting a perfor - mance issue with three functions: Angle, Latitude, and Longitude. There are 59 workloads defined by community- sourced expert-written metrics. The goal of the coding agent is to modify the repository to optimize these work- loads while still maintaining correctness. Performance e v aluation proceeds by e xecuting the full set of workloads on both Code 0 and Code agent and compar- ing their measured outcomes. Improving performance on one workload may degrade performance on others ( Bal- samo et al. , 2004 ; W oodside et al. , 2007 ; Jin et al. , 2012 ). As a result, optimization in F O R M U L A C O D E is inherently multi-objectiv e: agents must reason about trade-of fs across subsystems and deli ver improvements that are broad and consistent rather than localized to a single ex ecution path. 2.1. Metrics Speedup. For each workload i , we compare the perfor- mance ratio of Code agent versus Code 0 : speedup i = workload i ( Code 0 ) workload i ( Code agent ) . Having speedup > 1 indicates an improvem ent. These ratios are dimensionless and allo w performance changes to be compared across heterogeneous workloads. If Code agent does not pass correctness tests for workload i , then speedup i = 1 (i.e., the modifications were reverted). For n workloads, the ov erall speedup is the geometric mean: speedup agent = Y workload i speedup i ! 1 n . (1) Advantage. For each task, we also ha ve e xpert-written code modifications, Code expert . For example, the performance issue in Figure 1 w as e ventually resolv ed by a human e xpert. W e use the performance of Code expert as a reference point to characterize the dif ficulty of each task. W e can then define the advantage of an agent as: Adv agent = speedup agent − speedup expert . If an agent had simply memorized the expert solution (e.g., due to training data contamination), then the advantage is zero. Indeed, the goal of super -human optimization is to achiev e a large positi ve advantage. Appendix Figure 23 provides a geometric intuition for this metric. Stratified Advantage. W e no w turn to measuring advantage aggregated at different levels of granularity . W e use ℓ ∈ { 0 , 1 , . . . } to denote the code hierarchy level. • At the coarsest le vel ( ℓ = 0 ), we group workloads by entire modules such as algorithms.* . • At finer le vels, we group workloads under individ- ual classes or functions (e.g., algorithms.Sorting.* , algorithms.Sorting.time_sort_int.* ). Each le vel ℓ thus partitions the workloads into groups: G ( ℓ ) = { g ( ℓ ) 1 , . . . , g ( ℓ ) K ℓ } , where each workload belongs in some g ( ℓ ) k . W e can then define per-group adv antage as: Adv agent ,g = speedup agent ( g ) − speedup expert ( g ) , where speedup ∗ ( g ) is defined using Equation 1 computed only ov er workloads in g . The stratified advantage at le vel ℓ is then the a verage across all groups at that le vel: Adv ( ℓ ) agent = 1 |G ( ℓ ) | X g ∈G ( ℓ ) Adv agent ,g . The family { Adv ( ℓ ) agent | ℓ ∈ Z ≥ 0 } thus forms a multi-scale profile of an agent’ s performance. Because aggregation is performed ov er multiplicative speedup ratios within each group, Adv ( ℓ ) agent remains in the same metric family as the global advantage, b ut is sensitive to ho w performance gains are organized across the codebase hierarch y (Figure 22 ). Normalized Advantage. Finally , we introduce a normal- ized version of adv antage that explicitly accounts for noise and heterogeneity across workloads. Giv en the variance of the per-w orkload speedup ratios for an agent , σ 2 ( agent ) , we define the normalized advantage of an agent as: g Adv agent = Adv agent p σ 2 ( agent ) + σ 2 ( expert ) . Conceptually , g Adv agent captures a signal-to-noise ratio of the agent advantage, and re wards consistency across workloads. Cost W eighted Metrics. In practice, we also care about the inference budget of the optimization agent. W e estimate the total inference cost as cost agent = c in N in agent + c out N out agent where N in agent and N out agent denote the total number of input and output tokens, and c in and c out are the per-tok en prices. This allows us to define the cost-weighted adv antage: cost ( Adv agent ) = Adv agent cost agent , which captures the human-relati ve improvement obtained per unit of inference budget. W e will use these metrics in § 3 to ev aluate code optimization agents’ performance on real-world codebases. 2.2. Dataset Construction Here we briefly summarize our dataset construction. Full details can be found in Appendix A . Figure 2 shows an ov ervie w of our procedure. 3 Rule Based Filter(s) 766 Github repositories Scraping Repositories Statistical Validation At least 100 Stars Mann-Whitney U test to confirm oracle speedup 1232 candidates 75 repositories References at least 1 issue Replicable Dependencies Environment Synthesis 26,717 PRs 101 repositories Intent to improve performance? LLM Based Filter(s) Feedback-driven LLM Agent Successfully Build Performance Suite & Test Suite 957 tasks 70 repositories Figure 2: Overvie w of F O R M U L A C O D E construction pipeline. F O R M U L A C O D E follows a four stage pipeline to identify real-world performance optimization tasks. (1) Scrape compliant repositories (§ A.1.1 ). (2) Apply rule-based and LLM- based filters to identify candidate performance improvement pull requests (§ A.1.2 ). (3) Construct reproducible Docker en vironments for each candidate (§ A.1.3 ). (4) V alidate each candidate for correctness and statistically significant performance improv ement (§ A.1.4 ). The pipeline is fully automated and updates F O R M U L A C O D E with new tasks e very month. Repository Scraping. W e search for repositories with ma- ture performance benchmarking infrastructure. Using a CommonSQL script on GitHub’ s public dataset, we find 766 repositories containing Airspeed V elocity (ASV ; ( Droet- tboom et al. , 2025 )) performance workloads, ensuring the y hav e active maintenance, Python 3.8+ support, and at least 100 stars ( A.1.1 ). Attribute Filtering . W e scrape 26,717 pull requests from 127 repositories and apply both rule-based filters (mer ged status, benchmark infrastructure presence, appropriate file changes) and LLM-based intent classification to identify 3,181 candidate performance improv ements from 101 repos- itories. An LLM agent analyzes PR descriptions, patches, and link ed issues to v erify the primary intent is performance optimization. For each candidate, the submitted patch corre- sponds to Code expert (Appendix A.1.2 ). En vironment Synthesis. For each candidate, we auto- matically generate reproducible Docker b uild scripts using a refle xi ve LLM agent that iterativ ely refines installation commands based on build failures. Through chronologi- cal caching of successful scripts and targeted tool use, we synthesize verified en vironments for 1,232 tasks across 75 repositories (Appendix A.1.3 ). Statistical V alidation. W e execute e xpert patches and base- line code in isolated en vironments, measuring performance across all ASV workloads. Using Mann-Whitney U tests (p < 0.002; ( Mann & Whitney , 1947 )) and strict correctness checks (unit tests + snapshot tests), we retain only tasks with statistically significant, reproducible improv ements, yielding 957 final tasks across 70 repositories ( A.1.4 ). This pipeline projects to add an a verage of 27.00 ne w tasks per month. 3. Experiments W e organize our experimental findings into three categories. • First, we present overall performance metrics to in ves- tigate whether agents can achieve meaningful runtime speedups and whether they can outperform e xperts. • Second, we provide detailed breakdo wn of agent capabil- ities, examining performance across optimization strate- gies, optimization scope, and repository popularity . • Third, we present additional findings on cost- effecti veness, multi-workload optimization, data leakage, and ensemble approaches. W e compare four frontier LLMs – GPT -5 ( Singh et al. , 2025 ), Claude 4.0 Sonnet ( Anthropic , 2025 ), Gemini 2.5 Pro ( Comanici et al. , 2025 ), and Qwen 3 Coder ( Y ang et al. , 2025 ) – under two LLM Frameworks – T erminus 2 ( Merrill et al. , 2026 ) and OpenHands ( W ang et al. , 2025 ). T erminus 2 is ev aluated with all four models, while OpenHands is ev al- uated with GPT -5, Claude 4.0 Sonnet, and Qwen 3 Coder . Additional discussion of model and frame work choices ap- pears in Appendix § B.2.2 . Ev aluations are conducted on F O R M U L A C O D E - V due to compute constraints, using the metrics defined in § 2 . Full experimental details and addi- tional analyses are provided in Appendix § B.2 . 3.1. Global Leaderboard For each agent–model configuration, we compute the human-relativ e adv antage Adv and normalized adv antage g Adv defined in § 2 . W e then aggregate configurations into a global leaderboard using the Ranked P airs (R P) method ( T ideman , 1987 ), yielding a transiti ve ordering. T able 1 summarizes the resulting rankings. 4 Agent Model R P Rank ( Adv ) ↓ Adv ↑ g Adv ↑ speedup ↑ T erminus 2 GPT -5 7 -0.0504 -0.1387 1.0585 Claude 4.0 Sonnet 4 -0.0410 -0.1065 1.0987 Gemini 2.5 Pro 6 -0.0433 -0.1138 1.0963 Qwen 3 Coder 5 -0.0454 -0.1257 1.0677 OpenHands GPT -5 3 -0.0209 -0.0702 1.0825 Claude 4.0 Sonnet 1 -0.0112 -0.0483 1.0539 Qwen 3 Coder 2 -0.0301 -0.1529 1.0346 Human Expert - - 0.0000 0.0000 1.1040 T able 1: Global leaderboard of agent-model configurations on F O R M U L A C O D E - V. W e report the Ranked Pairs (RP) position in- duced by human-relati ve adv an- tage ( Adv ), the normalized ad- vantage ( g Adv ), and the speedup ( speedup ) as defined in § 2 . Module-level Class-level F unction-level −0.1 0.0 0.1 0.2 0.3 Stratified advantage Claude 4.0 Sonnet GPT -5 Gemini 2.5 Pro Qwen 3 Coder OpenHands T erminus 2 Figure 3: Sho wing stratified advantage across hierarchy lev- els for each agent–model configuration. Each line traces the stratified adv antage ( Adv ( ℓ ) agent ) over ℓ ∈ { 1 , 2 , 3 } , re- vealing whether a configuration prefers coarse module-le vel changes or fine-grained function-lev el edits. Observation: Agents achieve non-trivial speedups over the baseline. All ev aluated configurations attain speedup > 1 on F O R M U L A C O D E - V relati ve to the baseline codebase (associated with the issue), indicating that agents can suc- cessfully identify and implement runtime-relev ant changes. Observation: Agents underperform human e xperts on per - formance optimization tasks. F or all agents, the ov erall advantage, Adv , is negati ve, indicating a fundamental per- formance gap. W e also notice a disagreement between the Adv and speedup metrics for many configurations, where large performance gains on certain ‘easier’ tasks hav e a dis- proportionate influence on the global speedup score. The influence of such tasks is diminished in the Adv score, which compares each agent improvement to the corresponding ex- pert improvement; since tasks that are “easier” typically also admit larger e xpert speedups, this relativ e metric yields a more consistent difficulty reference. 3.2. Large-Scale vs. Small-Scale Refactors T o disentangle performance by optimization scale, we use the hierarchical structure of F O R M U L A C O D E - V workloads (Figure 22 ) and stratified advantage Adv ( ℓ ) agent from § 2 . W e construct per-configuration profiles across three strata: Mod- ule le vel aggre gation ( ℓ = 1 ), Class lev el aggregation ( ℓ = 2 ), and Function level aggreg ation ( ℓ = 3 ). For each configuration and lev el ℓ , we compute group-lev el speedups and advantages, sho wn in Figure 3 . Observation: Ag ents demonstr ate c haracteristic perfor- mance pr ofiles. In Figure 3 models exhibit di verse perfor- mance profiles. OpenHands + Claude 4.0 Sonnet performs best at the module-level optimization b ut underperforms at the function-le vel, indicating that this configuration can ov erlook small-scale optimizations in fav or of lar ge-scale ones. Conv ersely , OpenHands + GPT -5 performs best at the function-lev el but loses ef fectiv eness at the module-level. Observation: Agents ar e comparatively str onger on local optimizations. With fe w exceptions (notably Claude 4.0 Sonnet + OpenHands), configurations achiev e higher strati- fied advantage at function-le vel aggreg ation. 3.3. T ype of Optimization Pr oblem W e inv estigate whether models can outperform human e x- perts on particular classes of optimizations. For each prob- lem in F O R M U L A C O D E - V, we label the optimization at- tempted by the human-written patch using an LLM (see § B.1.7 for details). Next, we aggregate the adv antage of each agent–model pair within each optimization class. T a- ble 2 summarizes the results. Observation: Some optimization classes remain systemati- cally difficult for agents. W e observe certain optimization categories where agents outperform experts. Specifically , all agents were able to find f aster solutions in tasks where the expert attempted a parallelization or batching based solution. Con versely , all agents struggle when the human solutions require delegating to lower -level system implementations (C extensions, v ectorized operations). 3.4. Long-T ail Generalization Across Repository Popularity W e next study how performance v aries by repository popu- larity (measured using GitHub stars). W e compute adv an- tage statistics for each popularity quintile. Observation: Agents perform weakest on tail r eposito- ries. Agent performance is substantially lo wer in the first popularity quintile (Q1; bottom 20%), which comprises 5 T able 2: Per-tag advantage for each agent–model configuration. Columns correspond to optimization tags (see 7 ), and cells report the human-relative advantage restricted to workloads whose patches are annotated with the respective tag. OpenHands + GPT -5 shows strong advantage on algorithmic re writes and data-structure changes, while other models perform comparativ ely better on micro-optimizations or caching. Agent Model Algo Data Lo wer Approx Parallel Reduce Cache Batch Scale DB Micro I/O Higher Uncat T erminus 2 GPT -5 -0.064 -0.112 -0.233 – 0.010 -0.006 -0.054 0.028 – – 0.001 – -0.002 – Claude 4.0 Sonnet -0.019 0.011 -0.720 – 0.013 -0.028 -0.048 0.041 – – -0.038 – -0.009 – Gemini 2.5 Pro -0.029 0.011 -0.676 – 0.013 -0.028 -0.048 0.041 – – -0.038 – -0.007 – Qwen 3 Coder -0.023 0.007 -0.455 – 0.007 -0.079 -0.027 0.042 – – -0.066 – 0.005 – OpenHands GPT -5 0.015 -0.052 -0.211 – 0.015 -0.051 -0.018 0.040 – – -0.018 – -0.008 – Claude 4.0 Sonnet -0.028 0.023 -0.180 – 0.007 -0.049 -0.017 0.047 – – 0.086 – -0.005 – Qwen 3 Coder -0.020 -0.004 -0.203 – 0.012 -0.016 -0.019 0.051 – – -0.063 – 0.013 – T able 3: Performance across repository popularity quintiles (by GitHub stars). W e report Adv agent for workloads drawn from repositories in each quintile, from least popular (Q1) to most popular (Q5). Red signifies worse performance. Agent Model Q1 Q2 Q3 Q4 Q5 T erminus 2 GPT -5 -0.0194 0.0423 -0.0045 -0.2754 -0.0123 Claude 4.0 Sonnet -0.0450 -0.0062 0.0025 -0.3529 -0.0220 Gemini 2.5 Pro 0.0077 -0.0062 0.0024 -0.3311 -0.0445 Qwen 3 Coder -0.0691 0.0052 -0.0179 -0.1669 -0.0332 OpenHands GPT -5 -0.0387 0.0315 0.0072 -0.0769 -0.0068 Claude 4.0 Sonnet -0.1041 0.0291 -0.0200 -0.0378 0.0263 Qwen 3 Coder -0.0159 0.0137 0.0227 -0.0878 -0.0270 repositories with 133–202 GitHub stars. Expert patches, howe ver , yield comparatively large gains in this regime: speedup expert (Q1) = 1 . 1104 , the second-largest speedup across quintiles. One hypothesis is that smaller repos- itories contain more heterogeneous, high-impact micro- optimizations that may have already been discovered in larger , more mature repositories, leading to more variable (but sometimes high-impact) optimization opportunities. A second plausible hypothesis is distribution shift: smaller repositories may be less represented in training corpora, reducing agent effecti veness. Observation: Agents ar e most competitive on mid- popularity r epositories. In the 20th to 60th percentile range, mean advantages are closest to expert performance, and some configurations perform comparably with experts. W e hypothesize that this is due to two reasons. First, moderately popular repositories more closely match the agent’ s training distribution than tail repositories. Second, these reposito- ries hav e more unexploited optimization av enues relativ e to highly popular projects. Observation: P erformance dips in high-popularity r eposi- tories. Agent performance is lowest in the fourth quintile (Q4; 6,371-10,343 stars). In this regime, expert patches also yield the smallest gains: speedup expert (Q4) = 1 . 0822 , the lo west expert speedup across all quintiles. This pattern indicates reduced remaining optimization headroom in these repositories, where many simpler impro vements may hav e already been realized. Additionally , slight distribution shift 1.0 1.5 2.0 2.5 3.0 3.5 Mean Cost (USD) −0.05 −0.04 −0.03 −0.02 −0.01 Mean Advantage Better Claude 4.0 Sonnet Qwen 3 Coder Gemini 2.5 Pro GPT -5 Claude 4.0 Sonnet Qwen 3 Coder Gemini 2.5 Pro GPT -5 OpenHands T er minus 2 Figure 4: Cost-Performance tradeoff of agent-model con- figurations. As most agents struggle on code optimizations tasks, the pareto set is primarily dominated by the most expensi ve model (Claude 4.0 Sonnet). may persist and limit agent effecti veness. 3.5. Practical Considerations 3 . 5 . 1 . C O S T E FFI C I E N C Y . Frontier models differ substantially in end-to-end inference cost due to provider pricing and the number of tokens con- sumed by a given agent configuration. In this experiment, we consider the cost–performance tradeoff within our agent configurations using the cost-weighted objectiv es defined in § 2 . T able 10 reports a leaderboard based on cost-weighted normalized adv antage, and Figure 4 summarizes the result- ing trade-off. Observation: Higher-priced models rank best under the cost-weighted objective. When weighted by cost, top-ranked configurations tend to use the higher-priced (and more capa- ble) models. A contributing f actor is that lo wer-capability models often consume more tokens within the agent loop, which can offset lo wer per-token prices. This might also indicate that smaller models lack the capabilities to reason effecti vely about performance optimizations. 6 1.00 1.02 1.04 1.06 1.08 1.10 Global Speedup 0.90 0.92 0.94 0.96 0.98 1.00 Worst W orkload Speedup Better Claude 4.0 Sonnet Qwen 3 Coder Gemini 2.5 Pr o GPT -5 Claude 4.0 Sonnet Qwen 3 Coder Gemini 2.5 Pr o GPT -5 Openhands T erminus 2 Expert Figure 5: Multi-workload tradeof f performance of agent- model configurations. W e quantify a model’ s speedup per- formance as a function of its worst re gression. The expert patch achiev es the highest speedup while negotiating con- siderably high workload regressions. 3 . 5 . 2 . M U L T I - W O R K L O A D T R A D E O FF P E R F O R M A N C E . Performance optimization necessitates a holistic understand- ing of competing workloads. In this experiment, we com- pare the global speedup achie ved by a model with the largest regression it causes. For each agent–model configuration, we compute (i) global speedup aggregated across tasks and workloads, and (ii) the av erage worst-workload speedup, defined as follows: for each task, we take the minimum speedup across the task workloads, and then average this minimum across tasks. Figure 5 plots these two quantities. Observation: Multi-workload optimization r emains chal- lenging for agents. Despite causing large regressions, hu- man code edits achie ve the best global speedup, indicating a superior ability to negotiate multi-w orkload performance tradeoffs than our configurations. 3 . 5 . 3 . T E M P O R A L G E N E R A L I Z A T I O N . Motivation. F O R M U L A C O D E is a liv e benchmark: tasks are continuously added and include creation timestamps. This enables us to probe the temporal out-of-distribution behavior of agents on performance optimization tasks. Related work on code correctness finds large gains when tasks are present in training corpora ( Jain et al. , 2024a ). W e buck et tasks by their month of creation and compute mean global speedup in windows defined by the temporal distance to each model’ s kno wledge cutof f (§ B.2.2 ). W e use 3-month bins and consider bins up to 6 months before/after the cutoff. T able 4 summarizes results. Observation: Limited evidence of a cutof f-aligned leakage effect. Performance shows no consistent shift when moving from pre-cutoff to post-cutof f task creation dates, suggesting the gap is capability-based rather than data-based. T able 4: T emporal analysis of model performance across knowledge cutoff boundaries. Each column represents a temporal bin defined by distance (in months) from the model’ s training data cutoff; values indicate mean global speedup ( speedup agent ) within each bin. W e find no consis- tent drop in performance. Before Cutoff After Cutoff Model 6+ mo 3-6 mo 0-3 mo 0-3 mo 3-6 mo 6+ mo Claude 4.0 Sonnet 1.0892 1.0564 0.9966 1.0915 1.0951 1.0519 GPT -5 1.1708 1.0454 0.9871 1.0378 1.0679 1.0500 Gemini 2.5 Pro 1.1071 0.9989 1.0219 1.0523 1.1063 1.0251 4. Related W ork Algorithms for Code optimization. There is a long his- tory of research on iterativ e code optimization using execu- tion feedback. Classical approaches to this problem were based on stochastic search and constraint solving ( Schkufza et al. , 2013 ; Sasnauskas et al. , 2018 ). Among deep-learning based approaches, AlphaT ensor and AlphaDev produce super-optimized matrix multiplication and sorting routines, respectiv ely ( Fawzi et al. , 2022 ; Mankowitz et al. , 2023 ). These systems combine large, publicly sourced pretraining datasets with carefully chosen inductiv e biases to make op- timization faster . The more general agentic Optimization workflo ws operate by iterati vely running LLM-generated code, ev aluating the output, and feeding the output back to the model. T erminus 2 and OpenHands represent two such configurations out of many that benefit from iterati ve feedback ( Y ao , 2024 ; Y ang et al. , 2024 ; Merrill et al. , 2026 ; W ang et al. , 2025 ; Merrill & Shaw , 2025 ). F O R M U L A - C O D E is the first benchmark purpose built to assess the multi-workload optimization ability of such agenti c AI algo- rithms in real-world codebases and provides the fine-grained ev aluation functions needed for iterativ e optimization. Evolutionary Optimization algorithms equipped with LLMs ( Romera-Paredes et al. , 2024 ; Grayeli et al. , 2024 ) iterativ ely improve a candidate pool of programs using exe- cution feedback. Systems like AlphaEv olve ( Noviko v et al. , 2025 ) and OpenEvolve ( Sharma , 2025 ) demonstrate that such agents can ef ficiently discov er and refine novel, high- performance code-based heuristics across div erse scientific domains. These methods are scalable but require high qual- ity ev aluation functions to penalize degenerate solutions. While F O R M U L A C O D E provides the necessary e valuation functions, we could not benchmark e volutionary methods due to their substantial compute needs. Code Generation Benchmarks. Coding benchmarks can be differentiated by their synthesis scope. For a list of differences, consult T able 5 . Function and file level. HumanEval ( Chen et al. , 2021 ) 7 T able 5: Comparing F O R M U L A C O D E with related codebase benchmarks. F O R M U L A C O D E is the only benchmark that satisfies the desired properties for e valuating LLM agents on real-world code optimization tasks. ++ denotes continually updating benchmarks. Data is sampled from real distributions like GitHub ( ), Leetcode ( ), AtCoder ( ), and Codeforces ( ); and LLM-generated or synthetic distributions ( ). An extended analysis is presented in § 4 . Benchmark Evaluation framework # T asks # W orkloads / T ask Live updates Data source Search space Synthesis scope Leakage resistant? GSO-Bench Performance 102 Single ✗ ; Large Repo ✗ SWE-Bench Unit T ests 2292 - ✗ Small Repo ✗ Liv eCodeBench Unit T ests 300 ++ - ✓ ; ; Small File ✓ SWEfficienc y Performance & Unit T ests 400 Single ✗ Large Repo ✗ CruxEval Unit T ests 800 ++ - ✗ Small File ✓ FormulaCode Performance & Unit T ests 957 ++ 264.58 ✓ Large Repo ✓ and MBPP ( Austin et al. , 2021 ) present hand-written pro- gramming problems in Python with corresponding unit tests. Many contrib utions extend these benchmarks to have more testing ( Liu et al. , 2023 ), broader scope ( Y in et al. , 2022 ; Y ang et al. , 2023 ), and more task div ersity ( Muennighoff et al. , 2023 ; Lai et al. , 2022 ; Zan et al. , 2022 ). CruxEval ( Gu et al. , 2024 ) benchmarks the code e xecution and reasoning ability of LLMs more deeply . Li veCodeBench ( Jain et al. , 2024a ) attempts to mitigate data-leakage by annotating prob- lems with release dates. All these benchmarking efforts uti- lize unit testing suites to gauge program correctness. F O R - M U L A C O D E supplements the ev aluation signal provided by the abov e datasets by using community-maintained e valua- tion functions that continually update with each commit. Repository level. Function and file lev el benchmarks ev alu- ate coding ability on self-contained coding tasks. Howev er, real software issues typically span multiple modules and files. Repository level benchmarks ( Jimenez et al. , 2024 ; T ang et al. , 2024 ; Jain et al. , 2024b ; Shetty et al. , 2025 ) aim to preserve the inherent challenges in real-world software engineering beyond text completion, such as finding rele v ant files, capturing relationships between modules, tracing infor - mation flo w , etc. SWE-Bench ( Jimenez et al. , 2024 ) collects GitHub issues from popular repositories and ev aluates cod- ing agents’ ability to resolve the issues. Follow-up ef forts benchmark agents on repository-conditioned code synthe- sis ( T ang et al. , 2024 ), scale-up benchmarking by admitting smaller codebases with LLM-generated unit tests ( Jain et al. , 2024b ), and introduce continually updating pipelines for the task ( Zhang et al. , 2025 ). Such extensions pro vide valuable insights into LLM agent beha vior yet ground their e v alua- tions in correctness tests, that present a discrete optimization surface for the agents. F O R M U L A C O D E complements these benchmarks by assessing agents on community-maintained ev aluation functions that present a smoother optimization landscape and higher fidelity than unit tests. Optimization Benchmarks. There are prior benchmarks for ef ficient code synthesis on function and file-le vel tasks. COFFE ( Peng et al. , 2025 ) samples tasks from HumanEval, MBPP , CodeContests, and APPS ( Chen et al. , 2021 ; Austin et al. , 2021 ; Hendrycks et al. , 2021 ) and auto-generates stress tests while ECCO ( W aghjale et al. , 2024 ) curates a function and file-lev el efficient synthesis dataset from IBM CodeNet ( Puri et al. , 2021 ) with data-mined test cases. Recent repository-level benchmarks like GSO-Bench ( Shetty et al. , 2025 ) and SWEf ficiency ( Ma et al. , 2025 ) also study LLM agents’ ability to optimize code. Ho w- ev er, these benchmarks only optimize for a single target function at a time. ( He et al. , 2025 ) don’t test correct- ness. In contrast, F O R M U L A C O D E focuses on: (1) using community-maintained benchmarks specifically designed to profile performance inef ficiencies instead of using hand- curated stress tests, (2) benchmarking on repository-level codebases, which better capture the natural challenges with real-world code optimization, and (3) presenting multiple workloads that can compete with one another to assess the holistic optimization ability of agents. 5. Conclusion W e present F O R M U L A C O D E , a comprehensi ve coding benchmark for repository-level agentic optimization. In this benchmark, coding agents must not only write code that passes standard correctness tests, but also impro ve runtime, and our benchmark design enables us to study the impact of repository popularity , temporal cutoffs, and multi-scale optimization to guide the design of future agents capable of surpassing human experts. As code-writing agents become more capable at the repository-lev el, F O R M U L A C O D E pro- vides a rigorous foundation for dev elopment. T o ensure longevi ty and pre vent saturation, we operate as a li ve bench- mark, continually ingesting new tasks to test agents against an ev olving human baseline. Our ev aluations show that F O R M U L A C O D E is a challenging benchmark for frontier LLMs and agentic framew orks, leaving open significant room for future agent dev elopment. 8 6. Acknowledgements This work was supported in part by a Laude Institute Sling- shot A ward, NSF aw ards III-#2505097, PPoSS-#2316161, NSF #2505096, NSF #2505098, and gifts from Point72 and OpenAI. W e also thank Alex Shaw , Braden Hancock, Miles Cranmer, Neehar K ondapaneni, Rogério Guimarães, Anant Asthana, and Markus Marks for helpful discussions. 7. Impact Statement W e have presented F O R M U L A C O D E : a benchmark for mea- suring the capabilities of LLM-guided agents to optimize performance on large codebases. F O R M U L A C O D E is de- signed to serve two audiences: researchers (those dev el- oping new LLMs / Agents) and practitioners (those using Agents for daily workflo ws). For researchers, we hope that F O R M U L A C O D E accelerates the development of coding agents by providing contamination-free training and e valu- ation signals. For practitioners, we hope F O R M U L A C O D E offers comparative metrics that gauge the utility of LLMs and agents in specialized repositories under di verse cost- performance constraints. In this section, we discuss the broader societal impacts and ethical considerations of our work. P otential for Misuse. Benchmark results are only as reliable as the interpretations drawn from them. T o ground e valua- tions in realistic de veloper w orkflo ws, we use community- maintained workloads that already e xist in each repository and attempt to preserve the same information and perfor- mance instrumentation a vailable to a human contributor . This design also supports practical impact: strong model- generated changes can, in principle, be merged upstream to reduce maintenance b urden, particularly for smaller repos- itories after thorough manual analysis. At the same time, reliance on repository workloads introduces an attack sur- face: an adversary could submit pull requests that alter or add workloads to mak e tasks artificially easier . While such additional workloads can increase regression cov erage (thereby providing some do wnstream utility), practitioners should treat workload pro venance and re view practices as part of the ev aluation’ s trust boundary . Privacy Concerns. F O R M U L A C O D E is an ‘open-book’ benchmark and necessarily includes interactions from open- source software de velopers. W e include such context to pro- vide models access to the same information a human would use when solving these tasks. Although we anonymize user - names and remove personally identifiable information to the best of our ability , some contributors may remain indirectly identifiable via secondary cues (e.g., writing style, repeated project-specific references). Bias and F airness. Benchmarks can incenti vize and influ- ence which capabilities are prioritized by the community . W e strive to make F O R M U L A C O D E ’ s metrics explicit and stable, and we apply statistical analyses to reduce unin- tended measurement artifacts. Y et, F O R M U L A C O D E inher- its limitations from the underlying repository benchmarks. In particular, F O R M U L A C O D E is susceptible to a form of the Quantitativ e Fallacy: aspects of agent competence that are difficult to measure may be underweighted or omitted, inflat- ing the true utility of such algorithms. This is a limitation of all ex ecution-based benchmarks. W e therefore recommend using F O R M U L A C O D E as a complementary signal rather than as a substitute for careful manual assessment of Agent / LLM behavior . References AI@Meta. Llama 3 model card. 2024. URL https://github.com/meta- llama/llama3/blob/ main/MODEL_CARD.md . Amazon W eb Services. Infrastructure security in amazon ec2: Isolation on physical hosts. Amazon EC2 User Guide. URL https://docs.aws.amazon.com/AWSEC2/ latest/UserGuide/infrastructure- security. html#physical- isolation . Accessed: 2026-01-28. Anthropic. The Claude 3 Model Family: Opus, Son- net, Haiku. https://www- cdn.anthropic.com/ de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/ Model_Card_Claude_3.pdf , March 2024. Model card v1.0. Accessed 31 May 2025. Anthropic. System card: Claude opus 4 & claude sonnet 4. PDF , May 2025. URL https://www- cdn.anthropic. com/6d8a8055020700718b0c49369f60816ba2a7c285. pdf . Includes changelog updates dated July 16, 2025 and September 2, 2025. Austin, J., Odena, A., Nye, M., Bosma, M., Michalewski, H., Dohan, D., Jiang, E., Cai, C., T erry , M., Le, Q., and Sutton, C. Program synthesis with lar ge language models, 2021. Balsamo, S., Di Marco, A., In verardi, P ., and Simeoni, M. Model-based performance prediction in software de vel- opment: A survey . IEEE T ransactions on Softwar e Engi- neering , 30(5):295–310, 2004. Chen, M., T worek, J., Jun, H., Y uan, Q., de Oli veira Pinto, H. P ., and et. al, J. K. Evaluating large language models trained on code, 2021. Comanici, G., Bieber, E., Schaekermann, M., Pasupat, I., Sachdev a, N., Dhillon, I., Blistein, M., Ram, O., Zhang, D., Rosen, E., Marris, L., Petulla, S., Gaffne y , C., Aha- roni, A., Lintz, N., Pais, T . C., Jacobsson, H., Szpektor , 9 I., Jiang, N.-J., Haridasan, K., Omran, A., Saunshi, N., Bahri, D., Mishra, G., Chu, E., Boyd, T ., Hekman, B., Parisi, A., Zhang, C., Kawintiranon, K., Bedrax-W eiss, T ., W ang, O., Xu, Y ., Purkiss, O., Mendlo vic, U., Deu- tel, I., Nguyen, N., Langley , A., Korn, F ., Rossazza, L., Ramé, A., W aghmare, S., Miller , H., Byrd, N., Sheshan, A., Hadsell, R., Bhardwaj, S., Janus, P ., Rissa, T ., Horgan, D., Abdagic, A., Belenki, L., Allingham, J., Singh, A., Guidroz, T ., Sriniv asan, S., Schmit, H., Chiafullo, K., Elisseef f, A., Jha, N., Kolhar , P ., Berrada, L., Ding, F ., Si, X., Mallick, S. B., Och, F ., Erell, S., Ni, E., Latkar , T ., Y ang, S., Sirkovic, P ., Feng, Z., Leland, R., Hornung, R., W u, G., Blundell, C., Alvari, H., Huang, P .-S., Y ip, C., Deur , S., Liu, L., Surita, G., Duque, P ., Damen, D., Jia, J., Guez, A., Mircea, M., Sinha, A., Magni, A., Stradomski, P ., Marian, T ., Gali ´ c, V ., Chen, W ., Husain, H., Singhal, A., Grewe, D., Aubet, F .-X., Song, S., Blanco, L., Rechis, L., Ho, L., Munoz, R., Zheng, K., Hamrick, J., Mather, K., T aitelbaum, H., Rutherford, E., Lei, Y ., Chen, K., Shukla, A., Moreira, E., Doi, E., Isik, B., Shabat, N., Rogozi ´ nska, D., K olipaka, K., Chang, J., V ušak, E., V enkatachary , S., Noghabi, S., Bharti, T ., Jun, Y ., Zaks, A., Green, S., Challagundla, J., W ong, W ., Mohammad, M., Hirsch, D., Cheng, Y ., Naim, I., Prolee v , L., V incent, D., Singh, A., Krikun, M., Krishnan, D., et al. Gemini 2.5: Pushing the frontier with adv anced reasoning, multimodality , long context, and next generation agentic capabilities, 2025. URL . Cruz, V . P . G., Rocha, H., and V alente, M. T . Snapshot testing in practice: Benefits and drawbacks. Journal of Systems and Softwar e , 204:111797, 2023. Droettboom, M., V irtanen, P ., and asv Dev elopers. air- speed velocity (asv): A simple python benchmarking tool with web-based reporting. https://github.com/ airspeed- velocity/asv , 2025. GitHub repository , v er- sion v0.6.5, accessed 2026-02-24. Fa wzi, A., Balog, M., Huang, A., Hubert, T ., Romera- Paredes, B., Barekatain, M., Noviko v , A., R. Ruiz, F . J., Schrittwieser , J., Swirszcz, G., et al. Discovering faster matrix multiplication algorithms with reinforce- ment learning. Natur e , 610(7930):47–53, 2022. Forum Discussion. Four kinds of optimisation (hacker ne ws discussion). Hack er Ne ws, Nov ember 2023. URL https: //news.ycombinator.com/item?id=38262251 . Forum Discussion. The fifth kind of optimisation (hacker ne ws discussion). Hacker News, April 2025. URL https: //news.ycombinator.com/item?id=43555311 . GitHub and Google Cloud Platform. bigquery-public- data.github_repos – github public repository dataset. https://console.cloud.google.com/marketplace/ details/github/github- repos , 2025. Queried via Google BigQuery on 30 May 2025. Grayeli, A., Sehgal, A., Costilla Reyes, O., Cranmer, M., and Chaudhuri, S. Symbolic regression with a learned concept library . Advances in Neural Information Pr ocess- ing Systems , 37:44678–44709, 2024. Gu, A., Rozière, B., Leather, H., Solar-Lezama, A., Syn- naev e, G., and W ang, S. I. Cruxe val: A benchmark for code reasoning, understanding and e xecution. arXiv pr eprint arXiv:2401.03065 , 2024. He, X., Liu, Q., Du, M., Y an, L., Fan, Z., Huang, Y ., Y uan, Z., and Ma, Z. Swe-perf: Can language models optimize code performance on real-w orld repositories?, 2025. URL https://arxiv.org/abs/2507.12415 . Hendrycks, D., Basart, S., Kada vath, S., Mazeika, M., Arora, A., Guo, E., Burns, C., Puranik, S., He, H., Song, D., and Steinhardt, J. Measuring coding challenge competence with apps, 2021. Jain, N., Han, K., Gu, A., Li, W .-D., Y an, F ., Zhang, T ., W ang, S., Solar-Lezama, A., Sen, K., and Stoica, I. Liv ecodebench: Holistic and contamination free ev al- uation of large language models for code. arXiv pr eprint arXiv:2403.07974 , 2024a. Jain, N., Shetty , M., Zhang, T ., Han, K., Sen, K., and Stoica, I. R2E: T urning any github repository into a programming agent en vironment. In Salakhutdinov , R., K olter, Z., Heller, K., W eller , A., Oliv er, N., Scarlett, J., and Berkenkamp, F . (eds.), Pr oceedings of the 41st International Confer ence on Machine Learning , volume 235 of Proceedings of Machine Learning Resear ch , pp. 21196–21224. PMLR, 21–27 Jul 2024b. URL https: //proceedings.mlr.press/v235/jain24c.html . Jimenez, C. E., Y ang, J., W ettig, A., Y ao, S., Pei, K., Press, O., and Narasimhan, K. Swe-bench: Can language models resolve real-world github issues?, 2024. URL https://arxiv.org/abs/2310.06770 . Jin, G., Song, L., Shi, X., Scherpelz, J., and Lu, S. Un- derstanding and detecting real-world performance b ugs. A CM SIGPLAN Notices , 47(6):77–88, 2012. Khattab, O., Singhvi, A., Maheshwari, P ., Zhang, Z., San- thanam, K., V ardhamanan, S., Haq, S., Sharma, A., Joshi, T . T ., Moazam, H., Miller , H., Zaharia, M., and Potts, C. Dspy: Compiling declarativ e language model calls into self-improving pipelines, 2023. URL https: //arxiv.org/abs/2310.03714 . K och, B., Denton, E., Hanna, A., and Foster , J. G. Re- duced, reused and rec ycled: The life of a dataset in machine learning research. In Thirty-fifth Confer ence 10 on Neural Information Pr ocessing Systems Datasets and Benchmarks T rack (Round 2) , 2021. URL https: //openreview.net/forum?id=zNQBIBKJRkd . Kwon, W ., Li, Z., Zhuang, S., Sheng, Y ., Zheng, L., Y u, C. H., Gonzalez, J. E., Zhang, H., and Stoica, I. Efficient memory management for lar ge language model serving with pagedattention. In Pr oceedings of the A CM SIGOPS 29th Symposium on Operating Systems Principles , 2023. Lai, Y ., Li, C., W ang, Y ., Zhang, T ., Zhong, R., Zettlemoyer , L., tau Y ih, S. W ., Fried, D., W ang, S., and Y u, T . Ds- 1000: A natural and reliable benchmark for data science code generation, 2022. LangDB. Qwen3-coder-480b-a35b-instruct by firew orksai. W eb page, July 2025. URL https://langdb.ai/app/models/fireworksai/ qwen3- coder- 480b- a35b- instruct . Model details, pricing, and performance metrics. Liu, J., Xia, C. S., W ang, Y ., and Zhang, L. Is your code generated by chatgpt really correct? rigorous ev aluation of large language models for code generation. arXiv pr eprint arXiv:2305.01210 , 2023. Ma, J. J., Hashemi, M., Y azdanbakhsh, A., Swersky , K., Press, O., Li, E., Reddi, V . J., and Ranganathan, P . Swe-fficienc y: Can language models optimize real-world repositories on real workloads?, 2025. URL https: //arxiv.org/abs/2511.06090 . Manko witz, D. J., Michi, A., Zhernov , A., Gelmi, M., Selvi, M., Paduraru, C., Leurent, E., Iqbal, S., Lespiau, J.-B., Ahern, A., et al. Faster sorting algorithms discovered using deep reinforcement learning. Natur e , 618(7964): 257–263, 2023. Mann, H. B. and Whitne y , D. R. On a test of whether one of two random v ariables is stochastically larger than the other . The annals of mathematical statistics , pp. 50–60, 1947. Merrill, M. and Sha w , A. T erminus. https://www.tbench. ai/terminus , May 2025. Published May 19, 2025. Ac- cessed 2026-01-28. Merrill, M. A., Shaw , A. G., Carlini, N., Li, B., Raj, H., Bercovich, I., Shi, L., Shin, J. Y ., W alshe, T ., Buchanan, E. K., Shen, J., Y e, G., Lin, H., Poulos, J., W ang, M., Nezhurina, M., Jitsev , J., Lu, D., Mastromichalakis, O. M., Xu, Z., Chen, Z., Liu, Y ., Zhang, R., Chen, L. L., Kashyap, A., Uslu, J.-L., Li, J., W u, J., Y an, M., Bian, S., Sharma, V ., Sun, K., Dillmann, S., Anand, A., Lan- pouthakoun, A., K oopah, B., Hu, C., Guha, E., Dreiman, G. H. S., Zhu, J., Krauth, K., Zhong, L., Muennighof f, N., Amanfu, R., T an, S., Pimpalgaonkar , S., Aggarwal, T . , Lin, X., Lan, X., Zhao, X., Liang, Y ., W ang, Y ., W ang, Z., Zhou, C., Heineman, D., Liu, H., Tri vedi, H., Y ang, J., Lin, J., Shetty , M., Y ang, M., Omi, N., Raoof, N., Li, S., Zhuo, T . Y ., Lin, W ., Dai, Y ., W ang, Y ., Chai, W ., Zhou, S., W ahdan y , D., She, Z., Hu, J., Dong, Z., Zhu, Y ., Cui, S., Saiyed, A., K olbeinsson, A., Hu, J., Rytting, C. M., Marten, R., W ang, Y ., Dimakis, A., Konwinski, A., and Schmidt, L. T erminal-bench: Benchmarking agents on hard, realistic tasks in command line interfaces, 2026. URL . Muennighoff, N., Liu, Q., Zebaze, A., Zheng, Q., Hui, B., Zhuo, T . Y ., Singh, S., T ang, X., von W erra, L., and Longpre, S. Octopack: Instruction tuning code large language models, 2023. Noviko v , A., V ˜ u, N., Eisenber ger , M., Dupont, E., Huang, P .-S., W agner, A. Z., Shiroboko v , S., K ozlovskii, B., Ruiz, F . J. R., Mehrabian, A., Kumar , M. P ., See, A., Chaudhuri, S., Holland, G., Davies, A., No wozin, S., K ohli, P ., and Balog, M. Alphaev olve: A coding agent for scientific and algorithmic discovery . Google DeepMind White P aper , May 2025. OpenAI, :, Agarwal, S., Ahmad, L., Ai, J., Altman, S., Applebaum, A., Arbus, E., and et al. gpt-oss-120b & gpt- oss-20b model card, 2025. URL abs/2508.10925 . Peng, Y ., W an, J., Li, Y ., and Ren, X. Coffe: A code efficienc y benchmark for code generation, 2025. URL https://arxiv.org/abs/2502.02827 . Puri, R., Kung, D. S., Janssen, G., Zhang, W ., Domeni- coni, G., Zolotov , V ., Dolby , J., Chen, J., Choudhury , M., Decker , L., Thost, V ., Buratti, L., Pujar , S., Ramji, S., Fin- kler , U., Malaika, S., and Reiss, F . Codenet: A large-scale ai for code dataset for learning a di versity of coding tasks, 2021. URL . Romera-Paredes, B., Barekatain, M., Noviko v , A., Balog, M., Kumar , M. P ., Dupont, E., Ruiz, F . J., Ellenberg, J. S., W ang, P ., F awzi, O., et al. Mathematical discov eries from program search with large language models. Natur e , 625 (7995):468–475, 2024. Sasnauskas, R., Chen, Y ., Collingbourne, P ., Ketema, J., Lup, G., T aneja, J., and Re gehr , J. Souper: A synthesizing superoptimizer , 2018. URL 1711.04422 . Schkufza, E., Sharma, R., and Aik en, A. Stochastic super- optimization. In A CM SIGARCH Computer Ar chitectur e News , v olume 41, pp. 305–316. ACM, 2013. Sharma, A. Opene volve: Open-source implementa- tion of alphaev olve. https://github.com/codelion/ openevolve , 2025. Software, v ersion 1.0.0. 11 Shetty , M., Jain, N., Liu, J., K ethanaboyina, V ., Sen, K., and Stoica, I. Gso: Challenging softw are optimization tasks for e v aluating swe-agents, 2025. URL https://arxiv. org/abs/2505.23671 . Shinn, N., Cassano, F ., Berman, E., Gopinath, A., Narasimhan, K., and Y ao, S. Refle xion: Language agents with verbal reinforcement learning, 2023. URL https://arxiv.org/abs/2303.11366 . Singh, A., Fry , A., Perelman, A., T art, A., and et al., A. G. Openai gpt-5 system card, 2025. URL https://arxiv. org/abs/2601.03267 . T ang, X., Liu, Y ., Cai, Z., Shao, Y ., Lu, J., Zhang, Y ., Deng, Z., Hu, H., An, K., Huang, R., Si, S., Chen, S., Zhao, H., Chen, L., W ang, Y ., Liu, T ., Jiang, Z., Chang, B., Fang, Y ., Qin, Y ., Zhou, W ., Zhao, Y ., Cohan, A., and Gerstein, M. Ml-bench: Evaluating lar ge language models and agents for machine learning tasks on repository-lev el code, 2024. URL . T ideman, T . N. Independence of clones as a criterion for voting rules. Social Choice and W elfare , 4(3):185–206, 1987. ISSN 01761714, 1432217X. URL http://www. jstor.org/stable/41105866 . T ratt, L. Four kinds of optimisation, November 2023. URL https://tratt.net/laurie/blog/2023/four_ kinds_of_optimisation.html . T ratt, L. The fifth kind of optimisation, April 2025. URL https://tratt.net/laurie/blog/2025/ the_fifth_kind_of_optimisation.html . W aghjale, S., V eerendranath, V ., W ang, Z., and Fried, D. ECCO: Can we improv e model-generated code ef ficiency without sacrificing functional correctness? In Al-Onaizan, Y ., Bansal, M., and Chen, Y .-N. (eds.), Pr oceedings of the 2024 Confer ence on Empirical Methods in Natural Language Pr ocessing , pp. 15362–15376, Miami, Florida, USA, November 2024. Association for Computational Linguistics. doi: 10.18653/v1/2024.emnlp- main.859. URL https://aclanthology.org/2024.emnlp- main. 859/ . W ang, X., Li, B., Song, Y ., Xu, F . F ., T ang, X., Zhuge, M., Pan, J., Song, Y ., Li, B., Singh, J., T ran, H. H., Li, F ., Ma, R., Zheng, M., Qian, B., Shao, Y ., Muennighof f, N., Zhang, Y ., Hui, B., Lin, J., Brennan, R., Peng, H., Ji, H., and Neubig, G. Openhands: An open platform for ai software de velopers as generalist agents, 2025. URL https://arxiv.org/abs/2407.16741 . W oodside, M., Franks, G., and Petriu, D. C. The future of software performance engineering. In Future of Softwar e Engineering (FOSE’07) , pp. 171–187. IEEE, 2007. Y ang, A., Li, A., Y ang, B., Zhang, B., Hui, B., Zheng, B., Y u, B., Gao, C., Huang, C., Lv , C., Zheng, C., Liu, D., Zhou, F ., Huang, F ., Hu, F ., Ge, H., W ei, H., Lin, H., T ang, J., Y ang, J., T u, J., Zhang, J., Y ang, J., Y ang, J., Zhou, J., Zhou, J., Lin, J., Dang, K., Bao, K., Y ang, K., Y u, L., Deng, L., Li, M., Xue, M., Li, M., Zhang, P ., W ang, P ., Zhu, Q., Men, R., Gao, R., Liu, S., Luo, S., Li, T ., T ang, T ., Y in, W ., Ren, X., W ang, X., Zhang, X., Ren, X., Fan, Y ., Su, Y ., Zhang, Y ., Zhang, Y ., W an, Y ., Liu, Y ., W ang, Z., Cui, Z., Zhang, Z., Zhou, Z., and Qiu, Z. Qwen3 technical report, 2025. URL https: //arxiv.org/abs/2505.09388 . Y ang, J., Prabhakar , A., Narasimhan, K., and Y ao, S. Inter- code: Standardizing and benchmarking interacti ve coding with ex ecution feedback, 2023. Y ang, J., Jimenez, C. E., W ettig, A., Lieret, K., Y ao, S., Narasimhan, K., and Press, O. Swe-agent: Agent- computer interfaces enable automated software engi- neering, 2024. URL 15793 . Y ao, S. Language Agents: F r om Next-T oken Pr ediction to Digital Automation . PhD thesis, Princeton University , 2024. Y in, P ., Li, W .-D., Xiao, K., Rao, A., W en, Y ., Shi, K., Howland, J., Bailey , P ., Catasta, M., Michalewski, H., Polozov , A., and Sutton, C. Natural language to code generation in interactiv e data science notebooks, 2022. Zan, D., Chen, B., Y ang, D., Lin, Z., Kim, M., Guan, B., W ang, Y ., Chen, W ., and Lou, J.-G. Cert: Continual pre- training on sketches for library-oriented code generation, 2022. Zhang, L., He, S., Zhang, C., Kang, Y ., Li, B., Xie, C., W ang, J., W ang, M., Huang, Y ., Fu, S., Nallipogu, E., Lin, Q., Dang, Y ., Rajmohan, S., and Zhang, D. Swe-bench goes li ve!, 2025. URL 23419 . Zhao, W ., Jiang, N., Lee, C., Chiu, J. T ., Cardie, C., Gallé, M., and Rush, A. M. Commit0: Library generation from scratch, 2024. URL 2412.01769 . 12 A. F O R M U L A C O D E : Dataset Construction F O R M U L A C O D E consists of 957 multi-workload real-world code optimization problems from 70 repositories as of November 30th, 2025. W e dev elop an automated four-stage pipeline that extracts these problems from 105074 pull requests across 766 repositories on GitHub, as described in § A.1 and illustrated in Figure 2 . § A.2 summarizes the key properties of the dataset. At the time of collection, all frontier models tested on F O R M U L A C O D E struggle to outperform human experts (§ 3 ), though we expect more adv anced models to close this gap in the near future. A.1. Dataset Creation Overview . The dataset creation pipeline comprises four broad stages: (1) crawl GitHub repositories with high-quality expert-defined performance w orkloads (§ A.1.1 ), (2) filter out all candidate pull requests using rule-based and LLM-based attribute filters where the primary intent of merging the PR was not performance related (§ A.1.2 ), (3) synthesize an en vironment building script so that the terminal interface tools function (§ A.1.3 ), (4) Filter all candidate PRs that do not show statistically significant impro vement in performance workloads (§ A.1.4 ). A . 1 . 1 . S TAG E 1 : S C R A P I N G R E P O S I T O R I E S . Our benchmarking apparatus relies heavily on mature tools dev eloped within the Python performance benchmarking community (Appendix § B.2.1 ). T o use these tools, the core developers of a package write customized performance profiling workloads in a pre-specified format for their repository . This allo ws us to identify crowdsourced workloads as well as repositories with an established rigorous benchmarking procedure by searching for the presence of these tools. Appendix § B.1.1 provides additional details on the scraping process. Overall, this step yields 766 repositories. A . 1 . 2 . S TAG E 2 : A T T R I B U T E F I L T E R I N G . For each repository , we scrape pull requests that were merged into the def ault branch and that reference at least one issue. Next, we filter out all pull requests with missing patches and with unsatisfiable requirements (e.g. expired PyPI packages). This yields 26717 Pull Requests from 127 repositories. Finally , we construct a kno wledge graph of relev ant issues and comments referenced by the pull request, filtering out any nodes created after the PR creation date. The knowledge graph is rendered along with the merge commit patch and is analyzed by an LLM Agent to gauge whether the primary intent of the pull request is performance oriented. This is required to reduce the cost of re-running all repositories. Specific details are presented in Appendix § B.1.2 . This yields 3181 potential performance improving tasks from 101 repositories, presented in T able 15 . A . 1 . 3 . S TAG E 3 : S Y N T H E S I Z I N G R E P R O D U C I B L E E N V I RO N M E N T S . Before we v alidate that the performance improv ement claimed by the previous stage surf aces as a statistically significant improv ement in the workloads, we must b uild and install a developmental cop y of the package. Howe ver , automatically building such de velopment copies proves to be a non-tri vial task for three reasons. (1) Many scientific packages require complex tool interactions, which necessitate a bespoke b uild process. (2) The b uild process ev olves significantly as a project matures. (3) The documentation for b uilding packages tends to be extremely fragmented, requiring the reading of many plaintext and code files ( README.md , setup.py , CONTRIBUTING.md , etc.) to reproduce. W e automate the process of building such packages by developing a reflexi ve LLM agent ( Shinn et al. , 2023 ) that iteratively refines a shell script to build an editable en vironment for our benchmarking and testing apparatus. In the worst case, such an agent must be run on every potential candidate PR. Howe ver , we find that aggressively caching and reusing previous scripts significantly lowers the amortized comple xity of LLM queries (Figure 14 ). More details are presented in Appendix § B.1.5 . This process yields 1232 potential tasks with reproducible docker containers from 75 repositories. A . 1 . 4 . S TAG E 4 : S T AT I S T I C A L A N D C O R R E C T N E S S V A L I DA T I O N Giv en a reproducible b uild en vironment, we can apply the expert-produced patch and ensure it produces a statistically significant speedup (Appendix § B.1.6 ). W e of fer two kinds of correctness tests in each F O R M U L A C O D E testing suite: Unit T ests. Like contemporary w ork in building repository-centric code generation datasets ( Jimenez et al. , 2024 ), we find that the unit test suite needs to be manually validated to ensure proper operability . As such, in F O R M U L A C O D E - V , we 13 Mean Max Issue T ext Length (T ok ens) 2718.03 15781 Gold Patch # Lines edited 38.088 526 # Files edited 3.93 34 # Func. edited 6.06 54 W orkloads # Eval. Fns 264.58 1364 % Cov erage 41.24% 97.86% T able 6: (Micro-av eraged) statistics characterizing different attributes of a F O R M U L A C O D E task instance. The average F O R M U L A C O D E gold patch requires 5 . 2 more lines of code spread ov er 1 . 29 × more files and 1 . 01 × more functions than the av erage SWE-Bench ( Jimenez et al. , 2024 ) patch. present 108 problems where we manually synthesize and verify that the b uild process and test suite function properly . Snapshot T ests. After benchmarking performance w orkloads, we capture a snapshot of the immediate return v alues of the workloads (by ex ecution trace inspection). W e then compare it against a reference snapshot captured after the human-written code was benchmarked. Comparison is skipped for an y Python objects where equality is not defined. Such snapshot tests are commonly used in UX de velopment to ensure that an underlying hard-to-inspect system (Android’ s V iew Hierarchy , HTML DOM, or in our case, an arbitrary python package) does not change unexpectedly following codebase changes ( Cruz et al. , 2023 ). This snapshot testing framework allows us to construct correctness checks for all performance workloads, greatly increasing the correctness verification surface of each task. This process yields 957 statistically significant performance improvement tasks that form F O R M U L A C O D E . T able 16 shows a repository-lev el breakdown of the final dataset. The next section presents a deeper analysis of the dataset. A.2. F O R M U L A C O D E Analysis Multi-W orkload Optimization T asks. Code optimizations rarely ha ve isolated ef fects; an optimization in one part of the code could significantly slow do wn another part of the code or cause unwanted spikes in other resources (e.g., in some scenarios, memoization-based optimizations are undesirable as the y decrease runtime at the cost of increased memory usage). F O R M U L A C O D E handles this problem by framing performance optimization as a multi-workload optimization problem . Each F O R M U L A C O D E problem has on average 264.58 performance w orkloads that are presented to the optimization agent along with the problem description. The agent is ev aluated on the aggregate performance improvement achie ved by the optimization agent on all workloads. T o perform well on F O R M U L A C O D E , the agent must reason about the effect its changes hav e on multiple workloads spanning multiple target functionalities or multiple target resources. T ask Diversity . The general consensus in repository-centered dataset design is to restrict scraping problems to a curated set of repositories. While manual curation significantly eases the dataset construction process, it inadvertently creates a cumulativ e advantage for certain types of repositories and their respecti ve tasks. As explored in ( K och et al. , 2021 ), this Matthew ef fect ultimately leads to the benchmarks becoming disconnected from the broader task and ultimately hurts “in the wild” performance. Instead, F O R M U L A C O D E samples performance benchmarks from a large set of repositories based on whether the performance benchmarks adhere to the four axiomatic stages. Figure 6 showcases the set of repositories represented in F O R M U L A C O D E and T able 16 presents a more detailed overvie w . Contamination Resistance. Data contamination has been sho wn to ske w the performance of frontier models on man y code generation tasks mined from GitHub ( Zhang et al. , 2025 ). T o be resistant to such data contamination issues, F O R M U L A C O D E functions as a liv e dataset. W e update F O R M U L A C O D E ’ s problem set on the 31st of each month with new problems. Figure 15 sho wcases the distrib ution of F O R M U L A C O D E problems based on the merge date of the task. The earliest task was merged on 2017-10-21 and the most recent task is from 2025-11-21. 55.88% of the tasks were merged in the last two years, and we added, on av erage, 27.00 problems to the dataset every month in 2025. Hierarchical W orkloads. Based on the file structure of the benchmarks directory , we organize all w orkloads based on three lev els of increasing granularity: module, class, and function. As depicted in Figure 22 , this allo ws us to aggregate workloads in our analysis based on the semantic grouping assigned by core de velopers. 14 Dataset Composition. T able 15 sho ws the composition of F O R M U L A C O D E across dif ferent filtering stages. In § B.1.7 , we further characterize F O R M U L A C O D E using an automated taxonomy-based classifier that infers (i) the type of optimization problem and (ii) problem difficulty; the resulting distributions are reported in T ables 7 and 8 . W e find that, in F O R M U L A C O D E , roughly three categories account for ∼ 60% of problems (Micro Optimizations, Remove/Reduce w ork, and Construct Better Algorithms) as sho wn in T able 7 . Also, most e xpert solutions are inferred to be of Easy or Medium difficulty (T able 8 ). These distributions change only mar ginally in F O R M U L A C O D E - V. Sample Questions. Appendix § B.2.4 sho wcases example questions from F O R M U L A C O D E . B. Additional Details This Appendix presents more details on the following topics: § B.1 : Dataset Construction . This includes subsections on (§ B.1.1 ) Scraping repositories and compliant repositories discov ered, (§ B.1.2 ) Details on attribute filtering and repository-le vel composition after attrib ute filtering, (§ B.1.5 ) Docker container synthesis, and (§ B.1.6 ) statistical testing. § B.2 : Experiments . This section pro vides additional details on (§ B.2.1 ) the benchmarking framework used in F O R - M U L A C O D E , (§ B.2.2 ) agent-model configurations presented, (§ B.2.3 ) the taxonomy used for classifying the type of optimizations, (§ B.2.4 ) qualitativ e examples showing characteristic beha vior of various agent–model pairs, (§ B.2.5 ) the ev aluation framew ork used in F O R M U L A C O D E , and (§ B.3 ) additional analysis in F O R M U L A C O D E . B.1. Dataset Construction Details In this section, we provide details on the dataset construction process. Our core aim is to provide an automated pipeline for constructing a dataset of pull requests that are rele vant for performance benchmarking. The dataset was constructed on a single machine with Ubuntu 22.04 L TS running on a machine with 503 GiB RAM, a dual-socket Intel Xeon Platinum 8352Y CPU @ 2.20 GHz (128 hardware threads), equipped with 4xNVIDIA A40 GPUs (46 GiB VRAM each). Making the dataset from scratch takes ∼ 32 hours, consuming ∼ 100 GB of disk space for the metadata and ∼ 2 TB of disk space for the docker image cache. W e use two LLMs during the dataset construction process. For less comple x tasks such as te xtual classification and extraction, we use openai/gpt-oss-120b model served locally ( Kwon et al. , 2023 ; OpenAI et al. , 2025 ). For complex tasks such as en vironment build script synthesis, we first attempt to use the local LLM and fallback to the anthropic/claude-3-5-sonnet-20241022 ( Anthropic , 2024 ) model (with a one-time total cost of $446 for the entire dataset). The additional cost may change if a different locally a vailable LLM is utilized. B . 1 . 1 . R E P O S I T O RY S C R A P I N G W e identify compliant repositories by searching for the presence of mature tools developed within the Python performance benchmarking community . T o search for these repositories at scale, we de velop a CommonSQL script to search for the presence of performance-oriented tools and workloads in the GitHub Public Dataset on Google BigQuery ( GitHub & Google Cloud Platform , 2025 ), which snapshots about 2 . 8 × 10 6 open-source repositories and 2 × 10 9 code files. W e add additional filters to ensure only mature softw are packages are considered. Specifically , we ensure that each v alid repository has (1) Markers identifying the presence of at least one performance workload (e.g., asv.conf.json ); (2) does not fork an existing repository . (3) Presence of PR mer ges and active maintenance in the last three years. (4) Support for Python 3.8+. This leav es us with 766 repositories. The CommonSQL script ex ecutes in about 48 seconds and cost $9 . 4 . As an alternative, we can also use the GitHub Search API to query for the repositories. This yields the same number of repositories, b ut can be much slo wer due to API rate limits. B . 1 . 2 . R U L E - B A S E D F I L T E R I N G Once we have a list of compliant repositories, it is technically possible to e xecute and measure the performance of all pull-requests in the repository . Ho wever , as most pull-requests do not primarily intend to impro ve performance, this leads to unnecessary waste of compute resources. The rule-based filtering stage ensures that we collect performance metrics for only those pull requests where we can ensure that the pull request is suitable for benchmarking. Most filters in this stage aim to 15 identify unambiguous signals that disqualify a pull request from being used for benchmarking. The prominent filters are listed below: • Repository Compliance: W e select repositories that ha ve at least 100 GitHub stars. Belo w 100 stars, we found that repositories often lacked the necessary community engagement to produce good quality pull requests. • Pull Request Status: W e strictly filter for pull requests that ha ve been successfully merged ( state=’closed’ with a valid merged_at timestamp) within the target date range. W e also ensure that we can retrie ve and successfully apply the patch to the repository . • Benchmarking Infrastructure: The specific commit tree must contain an Airspeed V elocity (ASV) configuration file ( asv.conf.json ), ensuring the repository supported benchmarking at that point in history . • Core Content: W e explicitly e xclude commits that only touch non-functional paths, such as tests/ , docs/ , examples/ , .github/ , dist-info/ , build artif acts, or packaging metadata (e.g., pyproject.toml , requirements.txt ). • Heuristic Message Filtering: W e apply a re gex-based pre-filter to the commit message. Commits matching “negati ve” patterns (e.g., “revert”, “release”, “bump version”, “fix typo”, “formatting”) are discarded unless they also contain “positi ve” performance keywords (e.g., “speed”, “optimize”, “latency”, “throughput”, “memory”, “vectorize”). Ambiguous messages are retained for LLM classification. • Complexity Constraints: T o ensure feasibility for both the LLM context and the build system, we exclude commits that change more than 500 files or 80,000 lines of code, or where the patch size exceeds an acceptable conte xt window for a capable local LLM (64,000 tokens). These constraints can be adjusted based on the future capabilities of LLMs. • Build En vironment: W e clone each repository at the specific commit tree and attempt to b uild it using uv . uv is a fast python package manager that can be used to install dependencies from a project’ s dependency files (e.g., pyproject.toml , requirements.txt , or setup.py ). If the b uild fails, we discard the pull request. If the build succeeds, we pin the dependencies to ensure that the build environment can be reproduced. This is a compute-intensiv e process and, after parallelizing the build process, requires ∼ 13 hours for all pull requests on our machine. After applying these filters, we are able to select 26717 pull requests from 127 repositories that are suitable for benchmarking. B . 1 . 3 . P E R F O R M A N C E I N T E N T F I L T E R I N G The previous stage ensures that we only select pull requests that are suitable for benchmarking. Howe ver , it is still possible that the pull request does not primarily intend to improve performance. T o ensure that we only select pull requests that are suitable for benchmarking, we utilize a pre-trained local LLM to classify the pull request as performance improving. The primary objectiv e of this classifier is to filter out pull requests that pass the regex-based heuristic b ut are not bona fide performance optimizations. Common examples of such false positi ves include commits that contribute ne w features instead of improving performance, refactor code structure without runtime impact, or maintainability improv ements. The classifier analyzes the pull request description, file change summary , and the code patch to make this determination. The classifier is written in DSPy ( Khattab et al. , 2023 ) and the prompt is shown in Figure 13 . W e explicitly prioritize recall ov er precision. The prompt is configured to lean towards a “YES” classification in ambiguous cases. This design choice is deliberate, as false positi ves will be symbolically v erified in the subsequent benchmark e xecution stage, and discarded if they yield no measurable speedup. B . 1 . 4 . P R O B L E M S T A T E M E N T C O N S T RU C T I O N T o transform a raw pull request into a benchmark task, we must construct a clear, self-contained problem statement that defines the performance goal. W e employ a multi-stage pipeline to aggre gate context and e xtract a structured narrative. Context Aggr egation. For each candidate pull request, we scrape all a vailable metadata (title, body , labels, and comments, date of creation, and date of merge) that can be used to construct the problem statement. W e also fetch the file change summary and the raw patch content to ground the problem statement in the actual code changes. W e parse the pull request body and comments to identify linked issues (e.g., #123 , owner/repo#123 ). These references are resolved to their full issue 16 pandas-dev/pandas (222) scikit-lear n/scikit-lear n (143) qiskit/qiskit (142) xdslpr oject/xdsl (134) optuna/optuna (94) pydata/xar ray (69) sk-image (39) networkx (35) satpy (30) pymc (18) flo x (17) dimod (15) uxar ray (13) bottleneck (13) geopandas (13) sgkit (12) sour mash (11) datalad (10) mars (10) k artothek (10) momepy (9) rich (9) pygeos (7) napari (7) qcodes (7) msp (7) shapely (6) psygnal (6) activitysim (6) dascor e (5) pvlib-python (5) pybamm (5) modin (5) deepchecks (5) sunpy (4) beem (4) lmfit-py (4) cartopy (4) pybop (4) trackintel (4) outlines-cor e (4) dipy (3) k edr o (3) devito (3) tiledb-py (3) adaptive (3) h5py (2) param (2) pantab (2) nilear n (2) dedupe (2) geocat-comp (2) signac (2) np-fin (2) xbatcher (2) onents (2) tqdm oggm tograd pyphi or ch arviz ultrack pystac inde x patial y -python aics oto Figure 6: Distribution of tasks across repositories in F O R M U L A C O D E till November , 2025. F O R M U L A C O D E comprises of 957 tasks sampled from 70 diverse open source GitHub repositories. Most repositories are software tools used extensi vely within scientific communities. F O R M U L A C O D E sho ws a strong long-tail pattern of bespoke repositories that are rarely cov ered in contemporary code-generation datasets. T able 16 presents a detailed ov erview . descriptions and discussions, which are also parsed and aggregated into the problem statement. W e only include information that was a v ailable before or at the time the pull request was created to ensure that the problem statement is self-contained. Context Filtering. Before attempting extraction, we enforce a strict v alidity check: a pull request must hav e at least one linked issue or a descripti ve body . The rationale for this constraint is twofold. First, the linked issue typically provides the problem context (the bug report, performance regression analysis, or feature request) that moti vated the change. Second, a descriptive pull request provides details of the problem solved, the methodology used, and the solution, which can be helpful for computing metadata for the benchmark task as well as clarifying the ov erall task goal. Context Extraction. W e consolidate all linked issues into a single document using a static template (shown in Figure 8 ). In principle, the issue text alone should suf ficiently describe the initial observed performance regression or bottleneck. Howe ver , in practice, we find that while an issue pro vides the initial observed performance re gression or bottleneck, they frequently bundle multiple optimization directions that are implemented across sev eral pull requests. As a result, a problem statement deriv ed only from the issue can under -specify the problem statement’ s starting state, leading to an ambiguous task (an agent may optimize a different aspect than the original change). T o ensure that each problem statement pro vides a clear and self-contained description of the problem, we use another specialized LLM-based classifier to extract relev ant problem context from the pull request description. W e instruct the agent to specifically extract near -verbatim sentences corresponding to the performance goal and constraints relev ant to this PR. Each extracted sentence is symbolically verified to maintain a high degree of textual fidelity (High Longest Common Subsequence ratio) to preserve technical terms, error messages, and code snippets. Any pull request that fails to yield a valid problem conte xt is discarded as it lacks a defined starting state for the benchmark. This LLM-based extraction agent is implemented using DSPy , and the prompt is sho wn in Figure 10 . Examples. Figure 8 shows problem statements for some F O R M U L A C O D E tasks. Each problem statement has an initial set of static instructions, information about the problem extracted from the linked issues, and the initial direction of optimization extracted from the pull request description. The problem statement construction (§ B.1.4 ) and the performance intent filtering (§ B.1.3 ) stages are applied together to yield 3181 problems. B . 1 . 5 . S Y N T H E S I Z I N G R E P R O D U C I B L E E N V I RO N M E N T S Motivation. A critical challenge in benchmarking historical commits is that the build en vironment (dependencies, compilers, and system libraries) is often implicit and evolv es over time. Simply installing the package via pip is insufficient for performance benchmarking for two main reasons: First, performance-critical Python packages often rely on compiled 17 extensions (C/C++, Cython, F ortran) that must be built from source to accurately reflect the performance characteristics of the code at that specific commit. Installing pre-built binaries (wheels) w ould benchmark the packaged version rather than the code in the pull request. Second, developers often introduce bespok e dependencies or modify build configurations in a pull request, rendering pre vious en vironments obsolete. T o address this, we implement an agentic pipeline to synthesize a reproducible Docker en vironment for each task. Setup. For each task, we first construct a Docker container with the base dependencies installed (Refer to the ‘Build En vironment’ subsection under § B.1.2 ) containing the source code of the repository at the initial state of the pull request. Our goal is to synthesize a build script that contains shell commands to install an editable version of the package from source. W e also want to ensure that certain tools (ASV , PyT est, and our snapshot testing tool) can be successfully run in the container . Agent. W e employ an iterati ve, reflexi ve agent to synthesize a valid b uild script. The agent is described in Figure 10 and has four principal components: V alidation & F eedback Loop: The synthesized script is ex ecuted in an isolated Docker container . W e validate the build using two v erification subroutines. (1) A profile check ensures that the package is importable, runnable, and we can run the ASV benchmarks under a generous timeout. (2) A pytest check ensures that we can run the p ytest test suite without errors. If the build or v alidation fails, the stderr and stdout logs are fed back to the agent as observ ations, allowing it to iterativ ely refine the script (e.g., installing missing system libraries, fixing syntax errors). Chr onological Retrieval: W e le verage the insight that build requirements rarely change drastically between adjacent commits. For a gi ven task, we sample 10 successful build scripts from the same repository , sourced from a database of successfully built tasks, sorted by commit date. W e first attempt to b uild the container using the script from the nearest chronological neighbor . If the build or verification fails, we mo ve to the ne xt neighbor until we either find a successful build or run out of neighbors. The failure logs are preserved and used as observ ations for the agent. Agentic Synthesis: If the retrie ved scripts f ail (or no history exists), we instantiate an LLM-based agent to generate a ne w build script. The agent acts as an interactive planner with access to the failure logs and a set of tools that allows it to inspect the repository state (e.g., list directories, read files, parse setup.py or pyproject.toml ). Given 10 interactiv e turns, the model can either choose to use one of the tools or prematurely end the turns by synthesizing a b uild script. The lar gest model we tried (Claude Sonnet 3.5 and GPT OSS 120b) rarely chooses to use tools as the error messages provide sufficient conte xt while the smallest model (Meta Llama 3.3 8B; ( AI@Meta , 2024 )) often utilizes many tool interactions before synthesizing the build script. LLM Choice and Prompt Design. W e find that a locally hosted openai/gpt-oss-120b provides the best balance of performance and cost. W e also implement a fallback to anthropic/claude-3-5-sonnet-20241022 if the build script synthesis fails after multiple tries. Overall, the chronological caching and local LLM cascade allows us to successfully synthesize a build script for 1232 out of 75 repositories at a cost of $446 to process 3181 PRs. This process yields 1232 reproducible containers for 3181 PRs. W e elected to stop the synthesis prematurely due to limited resources. Howe ver , with more resources, we expect the number of reproducible containers to substantially increase. B . 1 . 6 . S TA T I S T I C A L T E S T I N G A N D R O B U S T N E S S Finally , we must ensure that every retained task reflects a statistically significant and reproducible performance change. Because timing measurements are inherently noisy (e.g., due to OS scheduling, background load, and CPU power man- agement), we adopt the statistical significance validation procedure used by ASV to verify that the observ ed dif ferences between two code states are significant under repeated measurement on commodity hardware. Measurement protocol. All experiments are run on an A WS EC2 instance specified in § B.2.5 to ensures hardware isolation. For each candidate pull request, we ex ecute the expert-selected w orkloads Workloads = { w 1 , . . . , w n } on both the baseline codebase Code 0 and the human-optimized codebase Code ∗ expert on the same instance. For each w orkload w i , ASV repeatedly ev aluates the benchmark under a warm-up and multi-sample timing protocol (with interleaved rounds when enabled), yielding independent sample sets of observed runtimes for the baseline and human-edited codebases: X i = { x i 1 , . . . , x im } from w i ( Code 0 ) , Y i = { y i 1 , . . . , y ik } from w i ( Code ∗ expert ) , 18 where X i and Y i denote the set of measurements for workload w i from the baseline and human edited code respecti vely . W e preserve ASV’ s default sampling parameters (unless a repository ov errides it via workload-specific attributes), so that the resulting statistical decision procedure matches common practice in the Python benchmarking ecosystem. Mann–Whitney U test. T o test whether Code 0 and Code ∗ expert exhibit dif ferent performance distributions for a workload, we use the Mann–Whitney U test ( Mann & Whitney , 1947 ), a non-parametric two-sample test based on rank ordering. Formally , for samples X i and Y i , the U statistic can be written as U ( X i , Y i ) = m X a =1 k X b =1 I [ x ia > y ib ] + 1 2 I [ x ia = y ib ] , and the associated two-sided p -v alue quantifies evidence against the null hypothesis. Null hypothesis. For each w orkload w i , we test H 0 : X i and Y i are drawn from the same underlying distrib ution (i.e., the patch does not induce a statistically detectable change in the benchmark outcome), against the two-sided alternative that the distributions dif fer . W e only consider workloads that reject H 0 . Implementation. In practice, ASV applies a conserv ativ e two-stage decision rule. When suf ficient raw samples are av ailable, it applies the Mann–Whitney U test and declares a dif ference only if the resulting p -value is belo w a stringent threshold (default p < 0 . 002 ). If the sample sizes are too small for the U test to ev er reach this threshold (given the discrete nature of the test), ASV falls back to a pessimistic check based on uncertainty estimates: it computes a 99% confidence interval for each sample distrib ution and only declares a difference when these intervals do not ov erlap. This fallback biases tow ards not claiming a difference unless the separation is unambiguous. Dataset Inclusion Criterion. W e discard candidate tasks for which no workload exhibits a statistically significant change between Code 0 and Code expert under this rule. This ensures that e very retained task in F O R M U L A C O D E corresponds to a clear , reproducible, and statistically supported performance difference. T asks with no positiv e significant workloads are also discarded. This yields the final 957 problems used in F O R M U L A C O D E . The 108 problems in F O R M U L A C O D E - V subset are sampled from the best performing tasks in F O R M U L A C O D E . B . 1 . 7 . D A TA S E T C O M P O S I T I O N S TA T I S T I C S T o better study the characteristics of F O R M U L A C O D E , we develop an automated classifier that attempts to infer the kind of optimization based on a curated taxonomy (§ B.2.3 ). The classifier is similar to the one introduced in § B.1.3 . It takes as input a sample pull request along with the expert-written patch and attempts to categorize the human-written solutions using a manually curated taxonomy (T able 9 ). Such a methodology allo ws us to ef ficiently and scalably study the composition of an continuously growing set of problems. The prompts for this classifier are presented in Figure 9 and an example is presented in Figure 11 . The distribution of the types of optimizations is presented in T able 7 and the distribution of the inferred dif ficulty is presented in T able 8 . Importantly , the distribution of optimization problems and dif ficulty changes marginally between F O R M U L A C O D E and F O R M U L A C O D E - V. B.2. Experiment Details In this section, we provide additional details on the methodology used to e v aluate agents on F O R M U L A C O D E . All experiments ran on a single Ubuntu 22.04 L TS machine with 503 GiB RAM, Intel Xeon Platinum 8352Y CPU @ 2.20 GHz (128 hardware threads), 4 NVIDIA A40 GPUs (46 GiB VRAM each). Making the dataset from scratch takes ∼ 32 hours, consuming ∼ 100 GB of disk space for the metadata and ∼ 2 TB of disk space for the docker image cache. Our e valuation protocol is grounded in T erminal Bench ( Merrill et al. , 2026 ). Unless explicitly indicated otherwise, all experiments use the default hyperparameters defined by T erminal Bench. 1 INPUT SIGNATURE 2 3 problem_description : string 4 Problem statement and technical context from PR/issue. 5 6 git_patch : string 7 Git diff showing actual code changes. 8 9 file_change_summary : string 10 A markdown table summarizing all the files changed in the commit along with lines added/removed. 11 12 CLASSIFIER MODULE 13 14 Decide if this commit’s PRIMARY intent is to improve product/runtime performance. 15 Label YES only when there is CLEAR, EXPLICIT evidence in the description and/or patch that the runtime gets faster (e.g., algorithm change, fewer allocations, caching, vectorization, reduced I/O, async/non-blocking for throughput, latency reduction, memory footprint reduction, fix a speed regression). 16 17 Strong positive signals (weigh these collectively): 18 • PR title/body contains performance intent (e.g., “PERF:”, “speed up”, “faster”, “performance”). 19 • Linked issues/comments include benchmark links or timings demonstrating impact. 20 • Low-level/hot-path tweaks (e.g., reuse global context, avoid per-call init/teardown, vectorize C/NumPy). 21 22 Hard NO (non-performance) examples: 23 tests/ASV/harness-only changes; CI/workflows/build/packaging; coverage; pre-commit/format/lints (clippy/ruff/black); docs; version bumps; terminology/renames; pure refactors without performance claims; changes aimed at making perf tests pass but not improving runtime. 24 25 If ambiguous, weigh the concrete code changes and problem description together. 26 When there are specific performance cues (title keywords, measured timings, fewer allocations, vectorization, caching/reuse) lean YES ; otherwise NO . 27 28 OUTPUT SIGNATURE 29 30 reasoning : string 31 Deductive reasoning steps leading to the classification. 32 33 label : string 34 Final label: “YES” for performance-related, “NO” otherwise. 35 Figure 7: Prompt template used by the LLM-based performance intent classifier described in B.1.3 . The prompt defines the input signature (problem description, git patch, and file change summary), the classifier module specifying decision criteria for identifying performance-moti vated commits, and the output signature producing a reasoning trace and binary label (“YES”/“NO”). 20 1 Example PR 2 3 CLASSIFIER INPUT 4 5 problem_description : string 6 Labels: performance; Description: Fixes #14471. 7 Body: The new ParameterExpression.bind_all is a fast path for producing a numeric result. This has advantages over ParameterExpression.bind: 8 • Far fewer Python objects are allocated, since no new ParameterExpression objects need to be constructed and the output is guaranteed to be numeric. 9 • There is no historical API requirement to scan the incoming mapping for invalid keys or values, yielding a large performance improvement when the same mapping is used to bind many expressions. 10 • This provides a major complexity improvement when a large values dictionary is reused many times. 11 There is still room for further gains because the Rust-space ParameterExpression and SymbolExpr interfaces require more heap allocations than strictly necessary, but this already yields substantial speedups. 12 Issues: Fixes #14471. 13 The linked issue reports that ParameterExpression.bind scales with the size of the binding dictionary even when only a single parameter is needed, leading to severe performance penalties for large parameter tables. 14 Comments: 15 Currently in draft because there’s no tests - I’m just putting it up so Sam and Ian from #14471 can test it out for their use case. For the explicit example in that issue, a complete comparison on my machine: 16
Out of date timings 17 In [1]: from qiskit.circuit import Parameter, ParameterExpression 18 N: int = 100_000 19 parameter_values = {Parameter(f"th_{i}"): 1 for i in range(N)} 20 parameter_values[param := Parameter("my_param")] = 1 21 . . . 22 I think it’s fine without having the same behavior. For clarity it might be helpful to add a blurb to the bind_all docstring to say that “unlike bind, NaN and inf are in the range of expected outputs for this method”. 23 LGTM, thanks! 24 25 git_patch : string 26 diff –git a/crates/circuit/src/parameter/parameter_expression.rs b/crates/circuit/src/parameter/parameter_expression.rs 27 index 1f0406f62c7e..98da2ee3e9e6 100644 28 — a/crates/circuit/src/parameter/parameter_expression.rs 29 +++ b/crates/circuit/src/parameter/parameter_expression.rs 30 @@ -1048,6 +1048,40 @@ impl PyParameterExpression {} 31 + ##[pyo3(name = "bind_all")] 32 + ##[pyo3(signature = (values, *))] 33 + pub fn py_bind_all(&self, values: Bound) -> PyResult {} 34 + let mut partial_map = HashMap::with_capacity(self.inner.name_map.len()); 35 + for symbol in self.inner.name_map.values() {} 36 + let py_parameter = symbol.clone().into_pyobject(values.py())?; 37 + partial_map.insert(symbol, values.get_item(py_parameter)?.extract()?); 38 + 39 + let bound = self.inner.expr.bind(&partial_map); 40 ... 41 42 file_change_summary : string 43 File Lines Added Lines Removed Total Changes crates/circuit/src/parameter/parameter_expression.rs 43 0 43 crates/circuit/src/parameter/symbol_expr.rs 1 1 2 releasenotes/notes/parameter-bind-all-22337586429509f9.yaml 8 0 8 test/python/circuit/test_parameters.py 18 0 18 44 45 CLASSIFIER MODULE 46 47 Decide if this commit’s PRIMARY intent is to improve product/runtime performance. 48 Label YES only when. . . 49 . . . there are specific performance cues (title keywords, measured timings, fewer allocations, vectorization, caching/reuse) lean YES ; otherwise NO . 50 51 CLASSIFIER OUTPUT 52 53 reasoning : string 54 The pull request introduces a new fast-path method ParameterExpression.bind_all (and Parameter.bind_all) to bind parameters to numeric values with far fewer allocations and reduced overhead than the existing bind method. The PR explicitly emphasizes performance, using terms such as “fast path” and “huge performance gain”, and reports timing improvements from tens of milliseconds to microsecond-scale execution. 55 The implementation avoids constructing new ParameterExpression objects and skips expensive validation logic, directly producing numeric results. It is labeled with the performance tag and documented as a fast-path method for performance-sensitive use cases, making the primary intent of the change to reduce computational overhead and improve runtime performance of parameter binding. 56 Therefore, the commit’s primary purpose is to make parameter binding substantially faster. 57 58 label : string 59 ’YES’ 60 Figure 8: Example of performance intent classification for a real pull request (https://github.com/Qiskit/qiskit/ pull/14782) , illustrating ho w the prompt in Fig. 7 is instantiated and ho w the classifier produces a reasoning trace and binary label. 21 1 Performance type & difficulty classifier 2 3 INPUT SIGNATURE PROMPT 4 5 problem_description : string 6 Problem statement and technical context from PR/issue. 7 8 git_patch : string 9 Git diff showing actual code changes. 10 11 CLASSIFIER MODULE PROMPT 12 13 Decide the PRIMARY performance optimization technique and the difficulty level of the optimization. 14 15 Category mapping (when performance-related): 16 • Algorithm improvements : complexity reduction; switching to faster algorithms → use_better_algorithm 17 • Data structures / layout : sets, maps, indices; memory layout tuning → use_better_data_structure_and_layout 18 • System-level : C/Rust/NumPy/Vectorized/Native extensions → use_lower_level_system 19 • Approximation / heuristics : trade accuracy for speed → accept_less_precise_solution 20 • Parallelization : threads, processes, parallel algorithms (not just async I/O) → use_parallelization 21 • Cache & reuse : memoization, LRU, materialized results → cache_and_reuse 22 • Scheduling : batching, lazy execution, throttling → do_it_earlier_batch_throttle 23 • Database / storage : indices, query tuning, partitioning → database_and_storage_tuning 24 • Micro-optimizations : hot-path tweaks, guards, inlining → micro_optimizations 25 • I/O / latency hiding : async or non-blocking I/O, overlap I/O and compute → io_and_latency_hiding 26 • Higher-level systems : using optimized libraries or frameworks → use_higher_level_system 27 • Uncategorized : performance-related but does not fit the above categories → uncategorized 28 29 Difficulty (when performance-related): 30 • easy : localized change ( < 50 lines), minimal risk 31 • medium : module-level refactor, data structure changes 32 • hard : algorithm rewrite or architectural change 33 34 OUTPUT SIGNATURE PROMPT 35 36 category : OptimizationType 37 The classified optimization category. 38 39 difficulty : DifficultyLevel 40 The difficulty level of the optimization. 41 42 reasoning : string 43 Brief explanation of the classification. 44 Figure 9: Prompt template used by the LLM-based classifier for assigning each performance task an optimization category and dif ficulty le vel ( B.2.3 ). The prompt defines the input signature, a taxonomy-dri ven classification module that maps code changes to optimization types, and an output schema that produces the predicted category , dif ficulty , and a brief reasoning trace. 22 1 INPUT SIGNATURE 2 3 owner_repo : string 4 The repository this commit belongs to (e.g., scikit-learn/scikit-learn). 5 6 sha : string 7 The commit SHA that is currently checked out. 8 9 commit_date : string 10 The commit date in ISO format (e.g., 2023-10-05T12:34:56Z). 11 12 stderr_logs : string 13 Most recent stderr logs from the last build attempt (up to ∼ 8k tail-end characters). 14 15 stdout_logs : string 16 Most recent stdout logs from the last build attempt (up to ∼ 8k tail-end characters). 17 18 failure_more : string 19 Describes where the failure occurred (e.g., N/A, build failed, asv run failed). 20 21 last_docker_build_script : string 22 The previously generated docker_build.sh script. 23 24 repo_facts_json : string 25 JSON object containing inferred repository facts (paths, package names, versions, etc.). 26 27 toolbelt : string 28 Human-readable summary of available tools and their usage. 29 30 messages_log : string 31 Transcript of prior tool calls, actions, and observations. 32 33 BUILD AGENT MODULE 34 35 An interactive planner for producing a docker_build.sh bash script that builds and installs a Python repository inside micromamba environments. The agent may either: (A) Request a tool call with structured JSON arguments, or (B) Output the final executable build script. 36 If a tool is required, set next_action to one of: probe_repo | list_tree | read_file | try_import | none. 37 38 Tool call formats: 39 • read_file: {"path": "...", "max_bytes": 65536} 40 • list_tree: {"depth": 2} 41 • try_import: {"candidates": ["foo","bar"]} 42 Return docker_build_script only when fully satisfied with correctness and completeness. 43 Critical constraints on the generated script: 44 • Must be idempotent and safe to run inside Docker. 45 • Fully non-interactive; no user prompts. 46 • Must be valid executable Bash with no syntax errors. 47 • Must use real newline characters (not escaped \n). 48 • Must not output literal \n. 49 Post-install readiness requirements: 50 • After editable install, the environment must be immediately usable. 51 • A lightweight profiling sanity check and a lightweight pytest sanity check must start without immediate errors, even for projects that require execution from subdirectories. 52 • Test/benchmark extras and optional dependencies must be installed as needed for import and test discovery to succeed. 53 54 OUTPUT SIGNATURE 55 56 thought : string 57 Brief rationale describing the current decision or plan. 58 59 next_action : string 60 One of probe_repo, list_tree, read_file, try_import, none, or finish. 61 62 action_input : string 63 JSON arguments for the selected tool, or empty if no tool is called. 64 65 error_summary : string 66 Brief summary of the most recent build failure and its possible causes. 67 68 resolution_steps : string 69 Concrete steps required to resolve the failure. 70 71 docker_build_script : string 72 Final executable docker_build.sh script that successfully builds and installs the project from source. Figure 10: Prompt structure for the docker b uild agent ( B.1.5 ), defining its input state, tool-calling interf ace, constraints, and ex ecutable script output. 23 1 Example PR 2 3 CLASSIFIER INPUT 4 5 problem_description : string 6 Labels: performance; Description: Fixes #14471. 7 Body: The new ParameterExpression.bind_all is a fast path for producing a numeric result. This has advantages over ParameterExpression.bind: 8 • Far fewer Python objects are allocated, since no new ParameterExpression objects need to be constructed and the output is guaranteed to be numeric. 9 • There is no historical API requirement to scan the incoming mapping for invalid keys or values, yielding a large performance improvement when the same mapping is used to bind many expressions. 10 • This provides a major complexity improvement when a large values dictionary is reused many times. 11 There is still room for further gains because the Rust-space ParameterExpression and SymbolExpr interfaces require more heap allocations than strictly necessary, but this already yields substantial speedups. 12 Issues: Fixes #14471. 13 The linked issue reports that ParameterExpression.bind scales with the size of the binding dictionary even when only a single parameter is needed, leading to severe performance penalties for large parameter tables. 14 Comments: 15 Currently in draft because there’s no tests - I’m just putting it up so Sam and Ian from #14471 can test it out for their use case. For the explicit example in that issue, a complete comparison on my machine: 16
Out of date timings 17 In [1]: from qiskit.circuit import Parameter, ParameterExpression 18 N: int = 100_000 19 parameter_values = {Parameter(f"th_{i}"): 1 for i in range(N)} 20 parameter_values[param := Parameter("my_param")] = 1 21 print("Using the specialised ‘Parameter‘ methods:") 22 %timeit param.bind(parameter_values, allow_unknown_parameters=True) 23
. . . 24 I think it’s fine without having the same behavior. For clarity it might be helpful to add a blurb to the bind_all docstring to say that “unlike bind, NaN and inf are in the range of expected outputs for this method”. 25 LGTM, thanks! 26 27 git_patch : string 28 diff –git a/crates/circuit/src/parameter/parameter_expression.rs b/crates/circuit/src/parameter/parameter_expression.rs 29 index 1f0406f62c7e..98da2ee3e9e6 100644 30 — a/crates/circuit/src/parameter/parameter_expression.rs 31 +++ b/crates/circuit/src/parameter/parameter_expression.rs 32 @@ -1048,6 +1048,40 @@ impl PyParameterExpression {} 33 + ##[pyo3(name = "bind_all")] 34 + ##[pyo3(signature = (values, *))] 35 + pub fn py_bind_all(&self, values: Bound) -> PyResult {} 36 + let mut partial_map = HashMap::with_capacity(self.inner.name_map.len()); 37 + for symbol in self.inner.name_map.values() {} 38 + let py_parameter = symbol.clone().into_pyobject(values.py())?; 39 + partial_map.insert(symbol, values.get_item(py_parameter)?.extract()?); 40 + 41 + let bound = self.inner.expr.bind(&partial_map); 42 + bound.eval(true).ok_or_else(|| {} 43 + PyTypeError::new_err(format!("binding did not produce a numeric quantity: {bound:?}")) 44 + ) 45 + 46 ... 47 48 CLASSIFIER MODULE 49 50 Decide the PRIMARY performance optimization technique and the difficulty level of the optimization. 51 Category mapping (when performance-related): . . . 52 . . . algorithm rewrite or architectural change. 53 54 CLASSIFIER OUTPUT 55 56 category : OptimizationType 57 REMOVE_OR_REDUCE_WORK 58 59 difficulty : DifficultyLevel 60 ’medium’ 61 62 reasoning : string 63 The change adds a specialized ’bind_all’ fast-path that skips the heavy allocation and validation logic of the generic ’bind’ method, directly mapping parameters to numeric values and evaluating the expression. This reduces the amount of work performed per binding call, yielding large speedups. The implementation touches Rust code and adds tests but is a localized, moderate-size change, fitting a medium difficulty level. 64 Figure 11: Example application of the optimization type and dif ficulty classifier (Figure 9 ), illustrating the predicted category , difficulty le vel, and reasoning for a real pull request ( https://github.com/Qiskit/qiskit/pull/14782) 24 1 Judge performance related PR prompt 2 3 INPUT SIGNATURE PROMPT 4 5 problem_description : string 6 Problem statement and technical context from PR/issue. 7 8 git_patch : string 9 Git diff showing actual code changes. 10 11 file_change_summary : string 12 A markdown table summarizing all the files changed in the commit along with lines added/removed. 13 14 JUDGE SIGNATURE PROMPT 15 16 Decide if this commit’s PRIMARY intent is to improve product/runtime performance. 17 18 Label YES only when there is CLEAR, EXPLICIT evidence in the description and/or patch that the runtime gets faster (e.g., algorithm change, fewer allocations, caching, vectorization, reduced I/O, async/non-blocking for throughput, latency reduction, memory footprint reduction, fix a speed regression). 19 20 Strong positive signals (weigh these collectively): 21 - PR title/body contains performance intent (e.g., "PERF:", "speed up", "faster", "performance"). 22 - Linked issues/comments include benchmark links or timings demonstrating impact. 23 - Low-level/hot-path tweaks (e.g., reuse global context, avoid per-call init/teardown, vectorize C/NumPy). 24 25 Hard NO (non-performance) examples: tests/ASV/harness-only changes; CI/workflows/build/packaging; coverage; pre-commit/format/lints (clippy/ruff/black); docs; version bumps; terminology/renames; pure refactors without performance claims; changes aimed at making perf tests pass but not improving runtime. 26 27 If ambiguous, weigh the concrete code changes and problem description together. When there are specific performance cues (title keywords, measured timings, fewer allocations, vectorization, caching/reuse) lean YES; otherwise NO. 28 29 OUTPUT SIGNATURE PROMPT 30 31 reasoning : string 32 Deductive reasoning steps leading to the classification. 33 34 label : string 35 Final label: "YES" for performance-related, "NO" otherwise.’ 36 Figure 12: Structured DSPy prompt used to judge whether a pull request is primarily intended to impro ve runtime or product performance. The prompt specifies the required inputs (problem description, code diff, and file-level change summary), explicit decision criteria and exclusions for performance-related changes, and an output format consisting of a justification and a binary YES/NO label. The design emphasizes conservati ve, e vidence-based classification, prioritizing explicit runtime improv ements ov er incidental or refactoring-only changes. 25 1 Problem Extractor Prompt description 2 3 INPUT SIGNATURE PROMPT 4 5 pr_title : string 6 The GitHub PR title 7 8 pr_body : string 9 The GitHub PR description 10 11 pr_comments : string 12 Comments on the PR thread. 13 14 PROBLEM EXTRACTOR SIGNATURE 15 16 What problem is this Github PR trying to solve? Extract near-verbatim relevant text following the given JSON output. If no relevant context exists for a field, return an empty string for it. 17 18 OUTPUT SIGNATURE PROMPT 19 20 initial_observations: string | list[Any] | None 21 Objective symptoms of the problematic behavior, described in the present tense. Focus strictly on what is happening (metrics, user impact, frequency). Do not include causes, hypotheses, or explanations. 22 23 triage_attempts: string | list[Any] | None 24 The investigative steps and reasoning used to narrow down contributing factors—what you checked, what you ruled out, and what evidence you gathered to understand where the issue originates. 25 26 solution_overview: string | list[Any] | None 27 A concise description of the change(s) made and how they address the identified bottleneck or constraint. 28 29 solution_observations: string | list[Any] | None 30 What you observe after applying the change—new measurements, behavior differences, and any regressions or trade-offs that appeared. 31 Figure 13: Structured DSPy prompt used to e xtract the underlying problem and resolution context from a GitHub pull request. The prompt consumes the PR title, description, and discussion, and produces a structured summary capturing observed symptoms, triage steps, the implemented solution, and post-change observations. The design emphasizes near-v erbatim extraction and separation of observ ations, inv estigation, and outcomes. 26 T able 7: Patch classification distrib ution in F O R M U L A C O D E and F O R M U L A C O D E - V. The problems in F O R M U L A C O D E - V are sampled from the best performing tasks in F O R M U L A C O D E which is why some categories are o verrepresented. Inferred T ype of Optimization Problem % F O R M UL A C OD E % F O R M U L A C OD E - V Accept Less Precise Solution 0.6584 - Cache And Reuse 8.3128 4.6296 Database And Storage T uning 0.5761 - Do It Earlier Batch Throttle 2.4691 0.9259 Io And Latency Hiding 0.0823 - Micro Optimizations 20.2469 23.1481 Remov e Or Reduce W ork 20.0823 18.5185 Uncategorized 1.5638 - Use Better Algorithm 20.0823 26.8519 Use Better Data Structure And Layout 9.7119 12.9630 Use Higher Lev el System 2.9630 2.7778 Use Lower Le vel System 11.0288 9.2593 Use Parallelization 2.2222 0.9259 T able 8: The inferred dif ficulty of human solutions in F O R M U L A C O D E and F O R M U L A C O D E - V . Inferred Difficulty % F O R M U L A C O D E % F O R M U L A C O D E -V Easy 54.8971 60.1852 Medium 44.4444 37.0370 Hard 0.6584 2.7778 B . 2 . 1 . A I R S P E E D V E L O C I T Y M E T H O D O L O G Y T o benchmark a ne w function with Airspeed V elocity , a dev eloper supplies a setup(. . . ) routine and one or more time profiling functions (e.g. time_foo(. . . ) , time_bar(. . . ) ) and memory profiling functions (e.g. mem_foo(. . . ) , mem_bar(. . . ) ). asv then clones the repository , creates an isolated virtual en vironment, and records the performance characteristics for all commits. The tool ships with best-practice safeguards (CPU af finity , warm-ups, repeated trials, etc.) to control system variance. Section 2 includes additional safeguards to further minimize system v ariance. Airspeed velocity of fers many advantages to wards our goal of making a benchmark for code optimization: • Low barrier to entry . The minimalist interface means de velopers routinely add ne w benchmarks, expanding cov erage ov er time. Asv ships with a robust re gression-detection functionality which further motiv ates dev elopers to ensure that the asv benchmarks maximally cov er all performance critical parts of their software. • Maturity and reliability . First released on 1 May 2015, asv encapsulates nearly a decade of community e xperience in timing and memory profiling code on commodity hardware. Most common pitfalls have documented solutions, and well established platform-specific best practices, ensuring results are both accurate and precise. • CI integration. asv co-exists naturally with other continuous-integration tools, so each commit carries both performance and correctness metadata. B . 2 . 2 . M O D E L A N D A G E N T C H O I C E S Models. Our experimental design centers on four models – GPT -5, Claude 4.0 Sonnet, Gemini 2.5 Pro, and Qwen 3 Coder – that represent the strongest generally available systems for coding and tool-use workloads at the time of paper writing. W e selected these models because the y are nati vely inte grated with our inference provider and support long conte xt windo ws, function calling, and multi-turn interactions at a cost profile compatible with large-scale benchmarking. W e treat these models as representativ e of the frontier capability regime against which different agent architectures can be fairly compared. 1. GPT -5. GPT -5 ( Singh et al. , 2025 ) is OpenAI’ s flagship general-purpose model in this study , and we use the standard API configuration with built-in “thinking” enabled. It is a multimodal, tool-using model with strong performance on code, math, and long-context reasoning benchmarks, and is widely deployed in agentic coding systems. W e use the gpt-5-2025-08-07 version specifically with a documented kno wledge cutoff of late September 2024. 2. Claude 4.0 Sonnet. Claude 4.0 Sonnet ( Anthropic , 2025 ) is Anthropic’ s top-end general-purpose model at the time of 27 Task Metadata Docker script library 2. Generate Script with LLM Agent Reasoning Module Docker script Error log Previous attempt log 1. Sample chronologically adjacent scripts. Verifier Successful build unsuccessful build Docker script Docker script Figure 14: Overvie w of the pipeline for Dock er en vironment synthesis. The system reuses chronologically adjacent b uild scripts when possible, otherwise inv oking an LLM agent that generates and refines Docker scripts using build logs and repository context until a v erifier confirms a successful, reproducible build. our experiments, designed for complex reasoning, long-form generation, and tool-heavy workloads such as software dev elopment. Public reports place Claude 4.0 Sonnet at or near the frontier on a wide range of coding and reasoning benchmarks. W e use the claude-sonnet-4-20250514 version specifically with a documented knowledge cutof f date of January 2025, with training data extending to March 2025. 3. Gemini 2.5 Pr o. Gemini 2.5 Pro ( Comanici et al. , 2025 ) is Google DeepMind’ s latest high-end model at the time of writing, introduced as the first member of the Gemini 2 series and optimized for complex multimodal reasoning. It offers a very large conte xt window (up to 1M tokens in the previe w configuration) and supports advanced tool-calling and code ex ecution. It has a documented knowledge cutof f date of January 2025. W e include Gemini 2.5 Pro to ensure that our agentic analysis cov ers three distinct provider ecosystems under comparable frontier -model conditions. 4. Qwen 3 Coder . Qwen 3 Coder is a large open Mixture-of-Experts model explicitly optimized for agentic coding tasks rather than general con versation. Qwen 3 Coder (in particular, the qwen3-coder-480b-a35b-instruct model) combines 480 B total parameters with sparse expert acti vation (35 B acti ve parameters per forward pass) and a conte xt window of roughly 262k tok ens, enabling it to reason over entire repositories and multi-file refactors in a single pass. Third-party model cards list a knowledge cutof f of 23 January 2025 ( LangDB , 2025 ). Empirically , Qwen 3 Coder claims strong results on SWE-Bench and related agentic coding and browser -use benchmarks ( Y ang et al. , 2025 ). Agents. W e ev aluate two agent framew orks within F O R M U L A C O D E : T erminus 2, the default harness for T erminal-Bench, and an agent implemented with OpenHands, a popular open-source frame work for AI-dri ven software de velopment. W e intentionally omit more complex agent families such as tree-structured search agents and ev olutionary or population-based methods. Tree agents that branch o ver alternativ e command sequences must maintain multiple snapshots of the terminal state, which quickly leads to e xponential blo wup in cloud compute usage. Evolutionary agents that track a Pareto frontier across many workloads are similarly expensi ve: given that the median F O R M U L A C O D E task exposes roughly 81 workloads, the number of candidate solutions required to reasonably explore the frontier is be yond our e valuation b udget. 1. T erminus 2. T erminus 2 is a reference agent for T erminal-Bench ( Merrill et al. , 2026 ). It is intentionally minimal: the agent spawns a single tmux session and e xposes the raw shell to the model, which issues commands as plain text and recei ves the terminal output verbatim, without additional structured tools or high-lev el abstractions. This architecture can be vie wed as a reflexi ve, single-trajectory agent that repeatedly observes the current terminal state, updates its internal plan implicitly in the model’ s hidden state, and emits the ne xt command. Despite its simplicity , T erminus 2 is competitive with more elaborate systems, making it a natural baseline for F O R M U L A C O D E . 2. OpenHands. OpenHands is a widely used open-source framew ork for AI-driv en software dev elopment ( W ang et al. , 28 Jan-Mar Apr-Jun Jul-Sep Oct-Dec 2025 2024 2023 2022 17 21 20 31 15 21 29 25 42 51 35 15 2 0 26 37 26 33 38 32 42 33 32 43 27 19 45 17 17 35 34 17 16 15 12 27 26 23 7 7 15 13 16 12 10 4 7 18 18 28 10 6 13 4 8 8 8 4 7 9 17 8 Figure 15: Timeline of F O R M U L A C O D E tasks organized by the date the e xpert-patch was merged till No vember , 2025. Each box represents the number of expert-patch tasks merged during a particular month/year . F O R M U L A C O D E is updated on the 31st of each month, and our most recent task is from 2025-11-21. The dataset gro ws by 20 . 25 tasks per month on av erage, facilitating contamination analyses for performance-optimization agents. T able 16 presents a detailed o vervie w . T able 9: Optimization cate gories used to cate gorize human solutions in F O R M U L A C O D E . The taxonomy is deriv ed from various online sources, listed in the primary references for each cate gory . Category Abbre viation Category Description Source Algo Use a better algorithm ( T ratt , 2023 ) Data Use a better data structure (and layout) ( T ratt , 2023 ) Lower Use a lower -level system ( T ratt , 2023 ) Approx Accept a less-precise solution (approximation/heuristics) ( Tratt , 2023 ) Parallel Use parallelization ( T ratt , 2025 ) Reduce Remove or reduce w ork (requirements & UX) ( Forum Discussion , 2025 ; 2023 ) Cache Cache & reuse ( Forum Discussion , 2025 ) Batch Do it earlier / batch it / throttle it ( Forum Discussion , 2025 ) Scale Scale the platform ( Forum Discussion , 2025 ) DB Database & storage tuning ( Forum Discussion , 2025 ) Micro Micro-optimizations (hot path tweaks) ( Forum Discussion , 2025 ) I/O I/O and latency hiding (async, o verlap I/O/compute) ( Forum Discussion , 2025 ; 2023 ) Higher Use a higher-le vel system that optimizes for you ( Forum Discussion , 2025 ) Uncat Uncategorized – 2025 ). OpenHands e xposes a flexible SDK that allo ws defining agents as compositions of tools and routines that can clone repositories, edit files, run tests, and manage long-running coding sessions, with support for swapping out the underlying LLM. In our experiments, we utilize a single-trajectory terminal-plus-editor agent implemented in the OpenHands SDK, following a def ault configuration used in terminal bench ( Merrill et al. , 2026 ). B . 2 . 3 . K I N D S O F O P T I M I Z AT I O N P RO B L E M S W e categorize human-written solutions in F O R M U L A C O D E into thirteen optimization classes gathered from various online sources. W e revie wed these sources, normalized overlapping suggestions into standard terminology , and used them to define the categories, which are then applied consistently in our analysis. This taxonomy is intentionally non-exhaustiv e: it serves as a practical baseline for analysis, capturing the principal codebase optimizations that de velopers typically consider when improving performance, rather than of fering an authoritative catalog of all systems optimizations. B . 2 . 4 . Q UA L I TA T I V E E X A M P L E S Qualitativ e examples are presented in Figure 25 , Figure 26 , and Figure 27 . 29 T able 10: Cost-a ware leaderboard of agent–model configurations. W e report cost per task, mean advantage Adv agent , cost-weighted advantage Adv cost agent , and cost-weighted normalized advantage g Adv cost agent . Agent Model Cost/T ask ↓ Adv agent ↑ Adv cost agent ↑ g Adv cost agent ↑ T erminus 2 GPT -5 1.8508 -0.0504 -0.0272 -0.0750 Claude 4.0 Sonnet 3.7722 -0.0410 -0.0109 -0.0282 Gemini 2.5 Pro 1.5455 -0.0433 -0.0280 -0.0737 Qwen 3 Coder 1.2060 -0.0454 -0.0376 -0.1043 OpenHands GPT -5 0.7814 -0.0209 -0.0267 -0.0899 Claude 4.0 Sonnet 3.2300 -0.0112 -0.0035 -0.0150 Qwen 3 Coder 1.0974 -0.0301 -0.0274 -0.1393 B . 2 . 5 . T E R M I N A L B E N C H M O D I FI C A T I O N S T erminal-Bench ( Merrill et al. , 2026 ) is a widely used harness for benchmarking terminal-based software dev elopment tasks. It is acti vely maintained, well understood by the agent development and benchmarking community , and already designed around end-to-end agent execution in a containerized shell en vironment. Howe ver , T erminal-Bench primarily targets correctness-oriented ev aluations. In F O R M U L A C O D E , the evaluation tar get shifts: tasks are optimization-centric and require measuring performance improv ements reliably , comparing multiple agent/model configurations under matched conditions, and auditing performance-oriented behavior and cost. W e therefore extend T erminal-Bench along four capability axes. Standar dized execution for low-variance measur ement. T o complement the v ariance-control safeguards in Section 2 , we add support for ex ecuting runs in standardized isolated en vironments (e.g., fixed cloud machines). This reduces machine- to-machine drift and makes speedup measurements more comparable across runs, which is essential when the benchmark signal is a relativ e performance change rather than a binary pass/fail outcome. Operationally , we extend T erminal Bench to support running tasks on compute optimized Amazon W eb Services (A WS) EC2 instances. Such instances are guaranteed to hav e a finite amount of isolated hardware resources situated in professionally-managed data centers, ensuring third-party reproducibility of F O R M U L A C O D E ’ s experiments ( Amazon W eb Services ). W e use the c5ad.large instance with 2 vCPUs, 4GiB RAM, and a dedicated 75 GiB SSD for storage. This instance is chosen specifically because it is extremely cost efficient (on-demand price of $0.086 per hour at the time of writing). Importantly , remote execution is a reproducibility con venience rather than a methodological prerequisite. The ASV -based protocol (warm-ups, repeated trials, and the v ariance controls in Section 2 ) is designed to yield reliable estimates on well-managed local commodity machines. W e use EC2 primarily to eliminate avoidable confounds – resource contention, background load, and hardware heterogeneity – to provide a clean gold-standard reference for subsequent experiments. Sequential agent e valuation. W e add controls to evaluate multiple agent/model configurations sequentially within the same standardized en vironment. For each F O R M U L A C O D E task, we provision a single instance and e valuate agent/model configurations in separate fresh containers: we measure the baseline implementation ( Code 0 ), then the human-written optimized solution ( Code expert ), and then each agent-produced candidate in turn, resetting the container state between configurations. This design ensures that comparisons are statistically matched by construction (same hardware and near-identical runtime conditions) while pre venting cross-run interference from accumulated state. Optimization-centric metrics. T erminal-Bench nati vely aggregates discrete outcomes (e.g., test pass/f ail). W e e xtend the measurement and analysis layers to parse and summarize continuous optimization signals (e.g., speedup, adv antage, and variance) and to support custom aggre gation procedures (e.g., stratification by difficulty , as described in Figure 23 ). Additional Accounting metrics. Finally , we add explicit support for token-usage and API-cost accounting, as well as other observability metrics (impro ved logging, robust timeout handling, and comprehensi ve interacti ve traces). These additions enable the cost-aware and f ailure-mode analysis reported in Section 3 . Overall, these modifications enable the use of T erminal-Bench as a stable ev aluation harness for F O R M U L A C O D E . 30 T able 11: Correctness constraint violations by agent–model configuration. F or each configuration, we report the total number of rejected solutions (out of 108), along with how man y are attributable to PyT est f ailures versus snapshot test failures. Agent Model T otal ↓ PyT est ↓ Snapshot ↓ T erminus 2 GPT -5 54 51 32 Claude 4.0 Sonnet 55 52 36 Gemini 2.5 Pro 55 53 30 Qwen 3 Coder 56 54 29 OpenHands GPT -5 47 42 30 Claude 4.0 Sonnet 50 43 34 Qwen 3 Coder 50 44 32 B.3. Additional Analysis This section lists additional analysis on F O R M U L A C O D E - V that was not included in the main paper for space reasons. W e analyze (1) the rate of correctness constraint violations across agent/model configurations, (2) the relationship between trajectory length and performance, (3) patterns of tool usage across configurations, and (4) qualitati ve e xamples of agent patches. Correctness Constraint V iolations. Each F O R M U L A C O D E - V task is associated with two types of correctness constraints: (1) Snapshot tests, that v erify that the optimized codebase preserves each w orkload’ s local v ariables, and (2) the original PyT est suite from the upstream repository which captures broader functional correctness. At initialization, the agent–model configuration receiv es explicit instructions to maximize performance while preserving corr ectness . If the patch fails either constraint, we ‘roll back’ any performance improvements and revert to the original codebase, ensuring that all reported speedups are strictly correctness-preserving. W e therefore ask: ho w often are candidate performance improving edits rejected solely due to correctness violations? For each agent–model pair , we count the number of tasks in which the final patch fails at least one test, and then further break this down into PyT est failures and snapshot test failures. T able 11 summarizes these statistics ov er 108 attempted solutions per configuration. Observation: Correctness violations ar e common and repr esent a major source of r ollbacks. W e find that models spend most of their budget exploring patches that ultimately fail correctness checks. On average, 52 . 43 % of trajectories are rejected due to correctness violations, with the majority of these failures stemming from PyT est suite violations rather than snapshot test failures. W e belie ve this to be a consequence of the multi-objectiv e nature of the optimization problem. A single-objecti ve setting allo ws verifying ne w functionalities with a single tool call. Ho wever , in a multi-objectiv e setting, the agent–model configuration must strategically allocate interactions towards running either the benchmarking tool, the snapshot verification tool, or the pytest suite depending on the ne w functionality it introduces. The tool call distribution in T able 14 supports this hypothesis, as most agents demonstrate an inclination to wards running performance v alidation commands rather than correctness validation commands. T rajectory Length and P erformance. Discovering ef fecti ve performance optimizations requires a deep understanding of the codebase. Agents must interact with the codebase through a terminal interf ace to obtain such an understanding. In this experiment, we study the relation between the number of interactions and the global performance achie ved by the cumulativ e trajectory of interactions. For each task, we record the number of complete command-line agent interactions (interactions where the agent runs a command and receives a response from the en vironment) and calculate the mean and median trajectory lengths averaged over all tasks. W e then calculate the length-weighted adv antage as len ( Adv agent ) = Adv agent len agent . T able 12 sho wcases these results. Observation: T rajectory lengths can be highly skewed. Some configurations demonstrate highly ske wed trajectories. Specifically , T erminus 2 + GPT -5 and T erminus 2 + Gemini 2.5 Pro have mean lengths substantially larger than the median length, suggesting that these configurations occasionally require very long interacti ve runs. By contrast, OpenHands + Claude 4.0 Sonnet has more stable trajectory lengths across tasks as the de viation between the mean and median is much smaller . Observation: Ag ent choice has a substantial effect on overall behavior . The same model behaves v ery differently depending 31 T able 12: T rajectory length and length-weighted advantage. For each agent–model configuration, we report the mean and median trajectory length (in interaction steps), as well as a length-weighted advantage ( len ( Adv agent ) ). Agent Model Mean Length ↓ Median Length ↓ len ( Adv agent ) ↑ T erminus 2 GPT -5 295.53 198.50 -0.000226 Claude 4.0 Sonnet 73.13 63.50 -0.000349 Gemini 2.5 Pro 106.99 63.50 -0.000755 Qwen 3 Coder 99.91 90.50 -0.000557 OpenHands GPT -5 68.60 61.00 -0.000299 Claude 4.0 Sonnet 222.80 219.50 -0.000106 Qwen 3 Coder 633.10 595.00 -0.000044 T able 13: T ool categories used in trajectory classification. The classifier’ s implementation mirrors that of the optimization category classifier (§ B.2.3 ). Category Description editing T ext editing or transformation commands (e.g., sed , awk , ed ). search Search/discov ery commands for finding files or text (e.g., grep , rg , find , fd ). view Read-only inspection commands for showing file/output snippets (e.g., cat , less , head , tail ). fs_ops Filesystem mutation/metadata operations (e.g., cp , mv , rm , mkdir , chmod ). shell_session Shell navigation/session management commands (e.g., cd , ls , pwd , clear , exit ). git V ersion-control commands and git-deriv ed shell variable setup (e.g., git , diff , reset ). python_exec Python ex ecution plus Python en vironment/package commands (e.g., python , pip , micromamba ). test T est-running commands, including snapshot checks (e.g., pytest , snapshot-tool ). bench Benchmark/profiling commands, primarily ASV workflo ws (e.g., asv run , asv profile ). patching Patch/dif f application commands or diff-marker lines (e.g., patch , applypatch , — / +++ ). other Commands/fragments that do not match the abov e classes, including control-flow snippets or terminal noise. on the chosen agent. For e xample, GPT -5 produces much longer trajectories in T erminus 2 than in OpenHands, while Claude 4.0 Sonnet and Qwen 3 Coder show the opposite pattern. This suggests that surrounding agent design heavily shapes search behavior . T ool-Usage Patterns. In all F O R M U L A C O D E tasks, agents are giv en unrestricted access to the bash command line with additional performance profiling and correctness testing tools. In this experiment, we analyze how dif ferent configurations employ tools during optimization. F or each task, we store the command-line interactions of the agent–model configurations and use an LLM to categorize the input commands based on the primary purpose of the command. The implementation is identical to the performance cate gorization classifier § B.1.7 . W e then aggre gate the tool type classifications by total tool uses and tool uses per category . T able 14 summarizes these statistics across all configurations. Observation: Agents in voke benchmarking tools more than testing tools. All agent–model configurations sho w a strong preference for running benchmarking and profiling commands o ver correctness validation commands, with an av erage T able 14: T ool-usage statistics by agent–model configuration. Columns report the total number of tool calls and the percentage distribution of calls across tool categories (Judged by openai/gpt-oss-120b using categories in T able 13 ). The most ef fective configurations spend the majority of their tool calls on file-operations ( editing , search , and view ) and running performance benchmarks ( bench ), with the remaining calls distributed across a v ariety of tool categories. Agent Model T otal editing search view fs_ops shell git python test bench patching other T erminus 2 GPT -5 13370 19.40 15.70 10.29 2.76 5.00 12.14 11.20 2.89 17.51 2.07 1.02 Claude 4.0 Sonnet 4214 6.12 8.00 11.25 7.78 8.19 0.00 16.61 2.18 6.36 0.00 33.51 Gemini 2.5 Pro 5641 11.35 6.29 5.96 7.43 11.45 2.80 5.48 0.62 16.04 0.27 32.32 Qwen 3 Coder 3565 17.59 16.89 8.61 12.17 12.45 0.45 8.72 1.49 9.99 0.36 11.28 OpenHands GPT -5 4683 14.35 19.65 26.44 0.62 1.62 4.36 7.94 3.93 18.26 0.04 2.80 Claude 4.0 Sonnet 6323 12.92 20.94 26.51 0.76 1.57 3.56 9.90 3.67 16.86 0.03 3.29 Qwen 3 Coder 8638 10.66 20.39 25.24 0.79 1.62 4.33 9.92 3.96 19.23 0.02 3.84 32 of 14 . 90 % of tool calls dedicated to benchmarking/profiling and only 2 . 68 % of calls dedicated to testing. This proclivity tow ards performance validation ov er correctness validation might hav e a substantial impact on our previous observation that correctness violations are prev alent for all agent–model configurations. Observation: Reading dominant tool cate gory . The most frequently used tool category across all configurations is file- system operations (editing, searching, and vie wing files), which accounts for an a verage of 31 . 74 % of all tool calls. This is consistent with the intuition that de veloping a holistic understanding of the codebase is a prerequisite for synthesizing effecti ve optimizations. B.4. Qualitative Examples: Human Expert vs. AI Agent Patches This section presents side-by-side comparisons of human expert and AI agent patches for F O R M U L A C O D E tasks. Specifically , the following e xamples are showcased: • Figure 16 ( modin_modin_2 ). Failur e mode: Incorr ect triage; expert gained edge by identifying performance hotpath. Modin has an expensi ve auto-switch backend logic that was being called e ven when all inputs shared the same backend. The agent was not able to identify the core issue, instead focusing on a caching issue that was not on the performance critical path. The human correctly identifies the issue and implemented a fix to the caching logic. • Figure 17 ( optuna_optuna_6 ). Failure mode: Correct triage; expert gained edge by numpy vectorization delegation. Optuna’ s hypervolume computation used a nai ve recursi ve algorithm, when a faster O ( N 2 ) approach was possible. Both the human and the agent were able to identify and implement the algorithm. Howe ver , the human’ s solution used fully vectorized numpy operations, while the agent’ s solution used a Python-level sweep-line approach with bisect . This resulted in the human outperforming the agent despite both having the same asymptotic comple xity . • Figure 18 ( optuna_optuna_1 ). Failure mode: Correct triage; expert implemented holistic full-module optimization. Optuna’ s implementation for sorting non-dominated Pareto fronts used a naiv e algorithm that didn’t scale well as number of trials increased. Both the human and the agent identified this issue; the agent’ s implementation utilized a Fenwick tree based algorithm which fixed a single hotpath (when inputs are 2D). Ho wev er, the e xpert implementation implemented a holistic rewrite: it optimized the entire call chain to use vectorized numpy operations and mer ged separate pathways for 2D/N-D optimization, resulting in complementary improv ements across the entire multi-objecti ve optimization flo w . • Figure 19 ( networkx_networkx_4 ). The core issue was that NetworkX’ s BFS-based component discov ery algorithm did not implement an early-termination optimization. Both the human and the agent fix this by implementing an early termination optimization. Howe ver , the agent outperforms the human by further optimizing the BFS implementation, achieving an additional +0 . 0132 advantage on top of the human’ s improvement. • Figure 20 ( pybamm_team_pybamm_1 ). A sensiti vity computation in PyBaMM created a quadratic memory allocation bottleneck due to incremental concatenation without realizing that the full size was kno wn in advance. Both the human and the agent identify the issue and collect all blocks first and concatenate once. The agent further optimizes the concatenation logic by consolidating multiple function calls into one and adding guards for empty inputs, resulting in a +0 . 0167 advantage. • Figure 21 ( shapely_shapely_1 ). The deprecate_positional decorator in Shapely called inspect.signature on ev ery inv ocation, causing 300–1000% slowdo wns. The human and agent solution both con verged on the same strategy: implementing a caching layer on the decorator . Howe ver , the agent implemented additional optimizations to skip the hot-path when no deprecated parameters existed, resulting in a +0 . 0132 advantage. 33 Human Expert Patc h diff --git a/modin/core/storage_formats/pandas/query_compiler_caster.py --- a/modin/core/storage_formats/pandas/query_compiler_caster.py +++ b/modin/core/storage_formats/pandas/query_compiler_caster.py @@ -1006,6 +1006,8 @@ pin_target_backend = None + input_backends: set[str] = set() def register_query_compilers(arg): + input_backends.add(arg_backend) @@ -1047,10 +1050,12 @@ - inputs_pinned = ( - len(input_query_compilers) < 2 and pin_target_backend is not None - ) - if not AutoSwitchBackend.get() or inputs_pinned: # Skip casting when all inputs share a single backend and either # auto-switching is disabled or inputs are pinned + if len(input_backends) == 1 and ( + not AutoSwitchBackend.get() or pin_target_backend is not None + ): f_to_apply = _get_extension_for_method(...) result = f_to_apply(*args, **kwargs) - if isinstance(result, QueryCompilerCaster) and inputs_pinned: + if (isinstance(result, QueryCompilerCaster) + and pin_target_backend is not None): result._set_backend_pinned(True, inplace=True) return result AI Agent P atch diff --git a/modin/core/storage_formats/pandas/query_compiler_caster.py --- a/modin/core/storage_formats/pandas/query_compiler_caster.py +++ b/modin/core/storage_formats/pandas/query_compiler_caster.py @@ -137,23 +137,52 @@ -def _get_empty_qc_for_default_backend() -> BaseQueryCompiler: - """Get an empty query compiler for the default backend.""" +def _get_empty_qc_for_backend(backend: str) -> BaseQueryCompiler: + """Get an empty query compiler for the specified backend.""" from modin.core.execution.dispatching.factories.dispatcher import FactoryDispatcher - return FactoryDispatcher.get_factory().io_cls.from_pandas(pandas. DataFrame()) + original_backend = Backend.get() + try: + Backend.put(backend) + return FactoryDispatcher.get_factory().io_cls.from_pandas(pandas. DataFrame()) + finally: + Backend.put(original_backend) -_BACKEND_TO_EMPTY_QC: defaultdict[str, BaseQueryCompiler] = defaultdict( - _get_empty_qc_for_default_backend -) +_BACKEND_TO_EMPTY_QC: dict[str, BaseQueryCompiler] = {} +def _get_cached_empty_qc(backend: str) -> BaseQueryCompiler: + if backend not in _BACKEND_TO_EMPTY_QC: + _BACKEND_TO_EMPTY_QC[backend] = _get_empty_qc_for_backend(backend) + return _BACKEND_TO_EMPTY_QC[backend] @@ -1042,7 +1071,7 @@ - input_qc_for_pre_op_switch = _BACKEND_TO_EMPTY_QC[ input_backend] + input_qc_for_pre_op_switch = _get_cached_empty_qc( input_backend) Summary Disables AutoSwitchBackend by default and re writes the casting-skip logic in query_compiler_caster.py to track the set of distinct input backends, skipping expensi ve query- compiler conv ersions when all inputs share a single backend. Updates four test files and adjusts metric assertions (not sho wn). Summary Fixes a bug where the defaultdict factory ignores the re- quested backend when creating empty query compilers, replac- ing it with an explicit _get_cached_empty_qc function that temporarily switches Backend.put() to the correct backend. A correctness fix, but not on the performance-critical path. Figure 16: modin_project-modin_2 : Modin’ s AutoSwitchBackend feature, enabled by default, triggered an expensiv e type con version ev en when all inputs shared the same backend. The agent solution ( openhands:claude-sonnet-4 ) identified and fixed a real bug in the caching logic, but this was not on the performance-critical path, resulting in a − 0 . 1265 advantage compared to the human expert’ s systemic fix that disabled AutoSwitchBackend by default and optimized the casting logic to track input backend di versity , skipping con versions when unnecessary . 34 Human Expert Patc h diff --git a/optuna/_hypervolume/wfg.py b/optuna/_hypervolume/wfg.py --- a/optuna/_hypervolume/wfg.py +++ b/optuna/_hypervolume/wfg.py # New O(N^2) vectorized 3D hypervolume via coordinate compression +def _compress_coordinate(coords: np.ndarray) -> tuple[np.ndarray, np. ndarray]: + sorted_indices = np.argsort(coords) + values = coords[sorted_indices] + r = np.zeros_like(sorted_indices) + r[sorted_indices] = np.arange(coords.shape[0], dtype=r.dtype) + return r, values +def _compute_3d(sorted_pareto_sols: np.ndarray, reference_point: np.ndarray ) -> float: + """O(N^2) 3D hypervolume via cumulative minimum on compressed coordinates.""" + n = sorted_pareto_sols.shape[0] + x_vals = np.concatenate([sorted_pareto_sols[:, 0], reference_point [:1]]) + y_ind, y_vals = _compress_coordinate(sorted_pareto_sols[:, 1]) + y_vals = np.concatenate([y_vals, reference_point[1:2]]) + z_max = reference_point[2] + zs = np.full((n, n), z_max) + for i in range(n): + zs[i, y_ind[i]] = sorted_pareto_sols[i, 2] + zs = np.minimum.accumulate(zs, axis=0) + zs = np.minimum.accumulate(zs, axis=1) + x_delta = x_vals[1:] - x_vals[:-1] + y_delta = y_vals[1:] - y_vals[:-1] + return np.sum((z_max - zs) * x_delta[:, np.newaxis] * y_delta[np. newaxis, :]) @@ -125,6 +178,13 @@ def compute_hypervolume( if reference_point.shape[0] == 2: hv = _compute_2d(sorted_pareto_sols, reference_point) + elif reference_point.shape[0] == 3: + hv = _compute_3d(sorted_pareto_sols, reference_point) else: hv = _compute_hv(sorted_pareto_sols, reference_point) AI Agent P atch diff --git a/optuna/_hypervolume/wfg.py b/optuna/_hypervolume/wfg.py --- a/optuna/_hypervolume/wfg.py +++ b/optuna/_hypervolume/wfg.py # x-sweep with incremental 2D Pareto frontier via bisect +def _compute_3d(sorted_pareto_sols, reference_point) -> float: + """Exact 3D hypervolume using x-sweep with incremental 2D frontier.""" + import bisect + xs = pts[:, 0] + dx = np.maximum(xs_ext[1:] - xs_ext[:-1], 0.0) + y_list: list[float] = [] + z_list: list[float] = [] + + def insert_frontier(y: float, z: float) -> None: + i = bisect.bisect_left(y_list, float(y)) + if i > 0 and z >= z_list[i - 1]: + return # dominated by left neighbor # ... (dominance-aware insertion: handle equal y, # remove dominated points to the right) + y_list.insert(i, float(y)) + z_list.insert(i, float(z)) + + for i in range(n): + insert_frontier(float(pts[i, 1]), float(pts[i, 2])) + if y_list: + yz = np.column_stack((np.asarray(y_list), np.asarray(z_list))) + areas[i] = _compute_2d(yz, ref_yz) + return float(np.dot(dx, areas)) @@ -126,7 +190,7 @@ def compute_hypervolume( - hv = _compute_hv(sorted_pareto_sols, reference_point) + hv = _compute_3d(...) if sorted_pareto_sols.shape[1] == 3 else _compute_hv(...) Summary Adds a specialized O ( N 2 ) _compute_3d function using a _compress_coordinate helper that maps y -coordinates to in- teger ranks via np.argsort , b uilds an N × N grid, and applies np.minimum.accumulate along both axes to compute domi- nated volume in fully vectorized numpy . Also adds a dedicated elif branch in compute_hypervolume and parameterized tests (not shown). Summary Adds a _compute_3d function using an x -sweep with incre- mental 2D Pareto frontier maintenance via bisect and Python lists. At each x -slice, the frontier is updated with dominance- aware insertion, then the 2D area is computed by delegating to _compute_2d . The dispatch in compute_hypervolume is modified with an inline ternary for 3D inputs. Figure 17: optuna_optuna_6 : Optuna’ s _hypervolume.WFG class used a naive recursiv e algorithm for hypervolume computation that had a O ( N 3 ) runtime for the common 3D case, when a O ( N 2 ) approach was possible. Both the human and the agent identified and implemented the faster algorithm. Howe ver , the human’ s solution used fully vectorized numpy operations, while the best agent ( terminus-2:gpt-5 ) used a Python-lev el sweep-line approach with bisect . This resulted in the human outperforming the agent with a − 0 . 03964 agent advantage despite both having the same asymptotic comple xity . 35 Human Expert Patc h diff --git a/optuna/study/_multi_objective.py b/optuna/study/ _multi_objective.py --- a/optuna/study/_multi_objective.py +++ b/optuna/study/_multi_objective.py @@ (selected excerpts) -def _get_pareto_front_trials_2d(...): - ... # Separate 2D implementation -def _get_pareto_front_trials_nd(...): - ... # Separate N-D implementation -def _get_pareto_front_trials_by_trials(...): - if len(directions) == 2: - return _get_pareto_front_trials_2d(...) - return _get_pareto_front_trials_nd(...) +def _get_pareto_front_trials_by_trials(...): + loss_values = np.asarray(...) + on_front = _is_pareto_front(loss_values, + assume_unique_lexsorted=False) + return [t for t, p in zip(trials, on_front) if p] -def _fast_non_dominated_sort( - objective_values, *, penalty=None, n_below=None +def _fast_non_domination_rank( + loss_values, *, penalty=None, n_below=None ) -> np.ndarray: - ... # O(n^2) broadcast + defaultdict + ... # Vectorized _calculate_nondomination_rank + ... # + _is_pareto_front with lexsort AI Agent P atch diff --git a/optuna/study/_multi_objective.py b/optuna/study/ _multi_objective.py --- a/optuna/study/_multi_objective.py +++ b/optuna/study/_multi_objective.py @@ -189,42 +189,106 @@ def _calculate_nondomination_rank(...): ... # Fast path for 2D objectives. + if objective_values.shape[1] == 2: + x = objective_values[:, 0] + y = objective_values[:, 1] + order = np.lexsort((y, x)) + ys_unique = np.unique(y) + y_idx_all = np.searchsorted(ys_unique, y, + side=’right’) + m = len(ys_unique) + bit = np.zeros(m + 1, dtype=int) + def bit_query(i): # Fenwick tree prefix max + ... + def bit_update(i, v): + ... # Process equal-x groups, BIT for rank + ... + return ranks, last_rank + # Fallback: original O(n^2) broadcast for >=3D. domination_mat = np.all(...) & np.any(...) Summary Complete re write of _multi_objective.py . Renames _fast_ non_dominated_sort to _fast_non_domination_rank , re- places the O ( n 2 ) broadcast-based algorithm with a vector - ized _is_pareto_front and _calculate_nondomination_ rank implementation, merges the separate 2D/N-D P areto front functions, and updates all callers across the TPE sampler and NSGA-II selection strategy . Summary Adds a specialized O ( n log n ) BIT (Fenwick tree) algorithm for 2D objectiv es in _calculate_nondomination_rank , falling back to the original O ( n 2 ) broadcast for ≥ 3 objectiv es. While algorithmically superior for the 2D case, the agent only opti- mizes the inner ranking function without restructuring callers or the Pareto front computation. Figure 18: optuna_optuna_1 : The original implementation of Optuna’ s non-dominated sorting in multi-objectiv e optimiza- tion cases emerged as a performance bottleneck when scaling to large number of trials ( ∼ 10000 trials). Both the best agent ( terminus-2:gpt-5 ) and the human e xpert correctly identified the issue. The agent’ s solution focused on optimizing the inner ranking function with a specialized O ( n log n ) Fenwick-tree algorithm for 2D objectiv es, while retaining the original O ( n 2 ) broadcast-based approach for higher dimensions. In contrast, the human e xpert implemented a holistic rewrite of the entire call chain to use vectorized numpy operations and merged separate pathways for 2D/N-D optimization. This broader vectorized re write captured multi-objectiv e speedups (5–6 × ) that the agent’ s localized change missed, resulting in a − 0 . 03629 agent adv antage. 36 Human Expert Patc h diff --git a/networkx/algorithms/components/connected.py @@ -65,7 +65,7 @@ def connected_components(G): for v in G: if v not in seen: - c = _plain_bfs(G, n, v) + c = _plain_bfs(G, n - len(seen), v) seen.update(c) diff --git a/networkx/algorithms/components/weakly_connected.py @@ -62,7 +62,7 @@ def weakly_connected_components(G): for v in G: if v not in seen: - c = set(_plain_bfs(G, n, v)) + c = set(_plain_bfs(G, n - len(seen), v)) seen.update(c) AI Agent P atch diff --git a/networkx/algorithms/components/connected.py b/networkx/ algorithms/components/connected.py index ebe0d8c1..922d4904 100644 --- a/networkx/algorithms/components/connected.py +++ b/networkx/algorithms/components/connected.py @@ -62,12 +62,15 @@ def connected_components(G): - [-n-]{+remaining+} = len(G) ... - c = _plain_bfs(G, [-n-]{+remaining+}, v) seen.update(c) + remaining -= len(c) yield c + if remaining == 0: + break @@ -203,14 +206,16 @@ def _plain_bfs(G, n, source): + seen_add = seen.add while nextlevel: thislevel = nextlevel nextlevel = [] + append = nextlevel.append ... if w not in seen: - [-seen.add(w)-]{+seen_add(w)+} - [-nextlevel.append(w)-]{+append(w)+} diff --git a/networkx/algorithms/components/weakly_connected.py b/networkx/ algorithms/components/weakly_connected.py index ecfac50a..a89b7af8 100644 --- a/networkx/algorithms/components/weakly_connected.py +++ b/networkx/algorithms/components/weakly_connected.py @@ -59,12 +59,15 @@ def weakly_connected_components(G): # (same early-exit optimization as connected_components above) @@ -166,32 +169,30 @@ def _plain_bfs(G, n, source): # (same local-variable caching as connected._plain_bfs above) # additionally, converted from generator (yield) to returning seen set: - yield source + ... if len(seen) == n: - return + return seen + return seen Summary Minimal single-line fix in both connected_components and weakly_connected_components : passes n - len(seen) in- stead of n to _plain_bfs , tightening the BFS early-termination bound so it stops as soon as all remaining unseen nodes are found. No structural changes to the BFS itself. Summary Multi-pronged optimization: tracks a remaining node count to break out of the component loop early , caches method lookups ( seen.add , nextlevel.append ) into local v ariables, and con- verts the weakly-connected _plain_bfs from a generator to a batch set return, eliminating per-node yield o verhead. Figure 19: networkx_networkx_4 : NetworkX’ s connected_components and weakly_connected_components passed the total graph node count n to _plain_bfs without accounting for already-discov ered nodes, missing an early-termination optimization. For disconnected graphs with large components explored last, this caused dramatic slo wdowns—up to 367 × for adversarial cases with n =1000 . Both the best agent ( openhands:gpt-5 ) and the expert identified the core issue, and implemented the same early-termination optimization. Howe ver , the agent also implemented additional micro-optimizations that further reduced ov erhead, resulting in a +0 . 0132 advantage ov er the human’ s solution. 37 Human Expert Patc h diff --git a/src/pybamm/solvers/processed_variable.py b/... --- a/src/pybamm/solvers/processed_variable.py +++ b/src/pybamm/solvers/processed_variable.py @@ -443,16 +443,18 @@ class ProcessedVariable: dvar_dp_func = casadi.Function( "dvar_dp", [t_casadi, y_casadi, p_casadi_stacked], [ dvar_dp] ) - for idx, t in enumerate(ts): - u = ys[:, idx] - next_dvar_dy_eval = dvar_dy_func(t, u, inputs_stacked) - next_dvar_dp_eval = dvar_dp_func(t, u, inputs_stacked) - if idx == 0: - dvar_dy_eval = next_dvar_dy_eval - dvar_dp_eval = next_dvar_dp_eval - else: - dvar_dy_eval = casadi.diagcat(dvar_dy_eval, next_dvar_dy_eval) - dvar_dp_eval = casadi.vertcat(dvar_dp_eval, next_dvar_dp_eval) + dvar_dy_eval = casadi.diagcat( + *[ + dvar_dy_func(t, ys[:, idx], inputs_stacked) + for idx, t in enumerate(ts) + ] + ) + dvar_dp_eval = casadi.vertcat( + *[ + dvar_dp_func(t, ys[:, idx], inputs_stacked) + for idx, t in enumerate(ts) + ] + ) # Compute sensitivity S_var = dvar_dy_eval @ dy_dp + dvar_dp_eval AI Agent P atch diff --git a/src/pybamm/solvers/processed_variable.py b/... --- a/src/pybamm/solvers/processed_variable.py +++ b/src/pybamm/solvers/processed_variable.py @@ -436,29 +439,30 @@ class ProcessedVariable: dvar_dy = casadi.jacobian(var_casadi, y_casadi) dvar_dp = casadi.jacobian(var_casadi, p_casadi_stacked) - dvar_dy_func = casadi.Function( - "dvar_dy", [t_casadi, y_casadi, p_casadi_stacked], [dvar_dy ] - ) - dvar_dp_func = casadi.Function( - "dvar_dp", [t_casadi, y_casadi, p_casadi_stacked], [ dvar_dp] # Single function returning both jacobians + grads_func = casadi.Function( + "pv_grads", [t_casadi, y_casadi, p_casadi_stacked], + [dvar_dy, dvar_dp] ) - for idx, t in enumerate(ts): + + dvar_dy_blocks = [] + dvar_dp_blocks = [] + for idx in range(ts.size): + t = ts[idx] u = ys[:, idx] - next_dvar_dy_eval = dvar_dy_func(t, u, inputs_stacked) - next_dvar_dp_eval = dvar_dp_func(t, u, inputs_stacked) - if idx == 0: - dvar_dy_eval = next_dvar_dy_eval - dvar_dp_eval = next_dvar_dp_eval - else: - dvar_dy_eval = casadi.diagcat(dvar_dy_eval, next_dvar_dy_eval) - dvar_dp_eval = casadi.vertcat(dvar_dp_eval, next_dvar_dp_eval) + g_dy, g_dp = grads_func(t, u, inputs_stacked) + dvar_dy_blocks.append(g_dy) + dvar_dp_blocks.append(g_dp) + # Concatenation in one shot + dvar_dy_eval = casadi.diagcat(*dvar_dy_blocks) + dvar_dp_eval = casadi.vertcat(*dvar_dp_blocks) # Compute sensitivity S_var = dvar_dy_eval @ dy_dp + dvar_dp_eval Summary Replaced the incremental per-timestep casadi.diagcat / casadi.vertcat loop with list comprehensions that b uild all Jacobian blocks first, then concatenate once via unpacking ( *blocks ). Also added a CHANGELOG.md entry (not shown). Summary Consolidated the two separate casadi.Function objects ( dvar_dy_func , dvar_dp_func ) into a single grads_func re- turning both Jacobians, reducing per-timestep function call ov erhead. Collects results in lists and concatenates once. Also adds guards for empty time series and empty result lists. Figure 20: pybamm_team-pybamm_1 : PyBaMM’ s ProcessedVariable sensitivity computation in IDAKLUSolver used an incremental per-timestep concatenation operation, creating a quadratic memory allocation o verhead. Both the best agent ( openhands:gpt-5 ) and the expert identified that, instead of each loop iteration b uilding a progressiv ely larger matrix by concatenating to the existing result, it would be more efficient to first collect all blocks and then concatenate once at the end. The agent added further micro-optimization: consolidating two accumulation function calls into one and added empty-input guards. This resulted in a +0 . 0167 agent advantage. 38 Human Expert Patc h diff --git a/shapely/decorators.py b/shapely/decorators.py --- a/shapely/decorators.py +++ b/shapely/decorators.py -def deprecate_positional(should_be_kwargs, category=DeprecationWarning): +def deprecate_positional( + should_be_kwargs: Iterable[str], + category: type[Warning] = DeprecationWarning, +): + def decorator(func: Callable): + code = unwrap(func).__code__ + pos_names = code.co_varnames[: code.co_argcount] + name_to_idx = {name: idx for idx, name in enumerate(pos_names)} + deprecate_positions = [ + (name_to_idx[name], name) for name in should_be_kwargs + if name in name_to_idx] + if not deprecate_positions: + return func + warn_from = min(deprecate_positions)[0] + + @lru_cache(10) + def make_msg(n_args: int): + used = [name for idx, name in deprecate_positions if idx < n_args] # ... (build warning message string) + return f"positional argument{plr} {args_txt} for ‘{func. __name__}‘ ..." - def decorator(func): @wraps(func) def wrapper(*args, **kwargs): - ret_value = func(*args, **kwargs) - sig = inspect.signature(func) - args_bind = sig.bind_partial(*args) - warn_args = [f"‘{arg}‘" for arg in args_bind.arguments.keys() - if arg in should_be_kwargs] - if warn_args: # ... (build warning message) - warnings.warn(msg, category=category, stacklevel=2) - return ret_value + result = func(*args, **kwargs) + n = len(args) + if n > warn_from: + warnings.warn(make_msg(n), category=category, stacklevel=2) + return result return wrapper AI Agent P atch diff --git a/shapely/decorators.py b/shapely/decorators.py --- a/shapely/decorators.py +++ b/shapely/decorators.py def decorator(func): # Cache signature inspection at decoration time + sig = inspect.signature(func) + param_names = list(sig.parameters.keys()) + deprecated_positions = set() + for param_name in should_be_kwargs: + if param_name in param_names: + deprecated_positions.add(param_names.index(param_name)) + if not deprecated_positions: + return func + max_deprecated_pos = max(deprecated_positions) @wraps(func) def wrapper(*args, **kwargs): - ret_value = func(*args, **kwargs) - sig = inspect.signature(func) - args_bind = sig.bind_partial(*args) # ... (per-call signature inspection) # Fast path: skip if not enough args + if len(args) <= max_deprecated_pos: + return func(*args, **kwargs) # Only check deprecated positions + warn_positions = [pos for pos in deprecated_positions if pos < len(args)] + if warn_positions: + args_bind = sig.bind_partial(*args) # ... (build and emit warning) + return func(*args, **kwargs) return wrapper Summary Completely re wrote the deprecate_positional decorator: replaced inspect.signature with inspect.unwrap and di- rect __code__ introspection at decoration time, added an lru_cache -backed make_msg helper to av oid rebuilding warn- ing strings, and included type annotations and a comprehensiv e 138-line test suite. Summary Cached inspect.signature at decoration time and pre- computed deprecated parameter positions as a set. Added an early-return fast path when no deprecated parameters exist and a second fast path skipping checking when argument count is below the threshold. Figure 21: shapely_shapely_1 : The deprecate_positional decorator in Shapely called inspect.signature and sig.bind_partial on ev ery decorated function inv ocation, causing a 300–1000% performance regression. Users reported significant Polygon creation slowdo wns. The best agent ( terminus-2:claude-sonnet-4 ) and the human expert con ver ged on nearly identical core strategies. Both implemented a caching layer to mo ve signature inspection from call time to decoration time. The agent added additional micro-optimizations to skip checks when no deprecated parameters exist or when the argument count is belo w the threshold. This resulted in a +0 . 0131 advantage ov er the human’ s solution. 39 T able 15: Repositories and T asks after applying rule-based filters (Filter Stage 1) and LLM-based filters (Filter Stage 2) as described in § A.1.2 . W e also showcase the number of tasks, the date of creation of the latest task, and additional information about the functionality and popularity of the repository . Most repositories are software tools used e xtensi vely within scientific communities. Repository Name #Stars #Forks Filter Stage 1 Filter Stage 2 Latest T ask Date Description 1 . scikit-learn/scikit-learn 63792 26359 2434 243 2025-10-31 scikit-learn: machine learning in Python 2 . pandas-dev/pandas 46922 19184 3298 560 2025-11-11 Flexible and po werful data analysis / manipulation library for Python, pro vid- ing labeled data structures similar to R data.frame objects, statistical functions, and much more 3 . scipy/scipy 14120 5516 1454 209 2025-10-29 SciPy library main repository 4 . apache/arrow 16089 3884 1988 267 2025-07-22 Apache Arrow is the univ ersal columnar format and multi-language toolbox for fast data interchange and in-memory analytics 5 . networkx/networkx 16277 3415 288 44 2025-09-16 NetworkX is a Python package for the creation, manipulation, and study of the structure, dynamics, and functions of complex networks. 6 . Qiskit/qiskit 6598 2659 717 212 2025-11-19 Qiskit is an open-source SDK for work- ing with quantum computers at the le vel of pulses, circuits, and application mod- ules. 7 . scikit-image/scikit-image 6371 2320 458 54 2025-11-18 Image processing in Python 8 . pymc-de vs/pymc 9322 2146 685 45 2025-09-23 PyMC (formerly PyMC3) is a Python package for Bayesian statistical model- ing focusing on adv anced Markov chain Monte Carlo (MCMC) and v ariational inference (VI) algorithms. 9 . T extualize/rich 54172 1920 165 11 2025-07-25 Rich is a Python library for rich te xt and beautiful formatting in the terminal. 10 . tqdm/tqdm 30580 1402 12 1 2022-03-24 Fast, e xtensible progress bar for Python and CLI 11 . pydata/xarray 4004 1192 609 101 2025-11-21 N-D labeled arrays and datasets in Python 12 . optuna/optuna 12922 1177 719 112 2025-11-05 A hyperparameter optimization frame- work 13 . quantumlib/Cirq 4772 1151 10 3 2025-11-18 Python frame work for creating, editing, and in voking Noisy Intermediate-Scale Quantum (NISQ) circuits. 14 . pvlib/pvlib-python 1424 1126 110 8 2025-10-03 A set of documented functions for sim- ulating the performance of photo voltaic energy systems. 15 . ipython/ipyparallel 2626 1006 65 6 2024-10-28 IPython Parallel: Interactiv e Parallel Computing in Python 16 . geopandas/geopandas 4940 981 314 22 2025-05-22 Python tools for geographic data Continued on next pa ge 40 Repository Name #Stars #Forks Filter Stage 1 Filter Stage 2 Latest T ask Date Description 17 . kedro-or g/kedro 10593 971 41 4 2025-07-17 Kedro is a toolbox for production-ready data science. It uses software engineer - ing best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular . 18 . HIPS/autograd 7379 928 13 1 2017-10-21 Efficiently computes deriv atives of NumPy code. 19 . MD Analysis/mdanalysis 1477 733 196 23 2025-10-13 MD Analysis is a Python library to ana- lyze molecular dynamics simulations. 20 . pybamm-team/PyBaMM 1387 692 218 17 2025-04-29 PyBaMM (Python Battery Mathemati- cal Modelling) is an open-source battery simulation package written in Python. 21 . modin-project/modin 10332 669 50 8 2025-09-30 Speed up your P andas workflo ws by changing a single line of code 22 . nilearn/nilearn 1322 631 138 2 2025-10-09 Machine learning for NeuroImaging in Python 23 . sunpy/sunpy 971 626 663 22 2025-05-16 sunpy is a Python software package that provides fundamental tools for accessing, loading and interacting with solar physics data in Python. 24 . shapely/shapely 4284 600 150 21 2025-05-03 Manipulation and analysis of geometric objects 25 . dedupeio/dedupe 4387 568 25 4 2023-12-19 A python library for accurate and scal- able data deduplication and entity- resolution. 26 . h5py/h5py 2174 547 263 35 2025-08-10 h5py is a thin, pythonic wrapper around HDF5 27 . PyW a velets/pywt 2294 517 12 1 2024-07-16 PyW a velets - W av elet T ransforms in Python 28 . pydicom/pydicom 2070 508 86 7 2025-05-12 Read, modify and write DICOM files with python code 29 . arviz-devs/arviz 1737 458 107 5 2025-10-21 Exploratory analysis of Bayesian mod- els 30 . napari/napari 2512 454 849 69 2025-09-30 napari: a fast, interactiv e, multi- dimensional image viewer for p ython 31 . tardis-sn/tardis 225 446 268 13 2025-09-16 T ARDIS - T emperature And Radiati ve Diffusion In Superno vae 32 . dipy/dipy 787 446 194 16 2025-11-18 DIPY is the paragon 3D/4D+ medical imaging library in Python. Contains generic methods for spatial normal- ization, signal processing, machine learning, statistical analysis and visual- ization of medical images. Additionally , it contains specialized methods for com- putational anatomy including dif fusion, perfusion and structural imaging. Continued on next pa ge 41 Repository Name #Stars #Forks Filter Stage 1 Filter Stage 2 Latest T ask Date Description 33 . python-control/python- control 1908 444 117 6 2025-06-21 The Python Control Systems Library is a Python module that implements basic operations for analysis and design of feedback control systems. 34 . SciT ools/cartopy 1545 389 74 6 2025-04-26 Cartopy is a Python package designed for geospatial data processing in order to produce maps and other geospatial data analyses. 35 . holoviz/datashader 3467 377 90 19 2025-10-09 Quickly and accurately render e ven the largest data. 36 . microsoft/Qcodes 396 335 187 10 2025-09-05 Modular data acquisition framew ork 37 . mars-project/mars 2748 326 164 51 2023-02-16 Mars is a tensor -based unified frame- work for lar ge-scale data computation which scales numpy , pandas, scikit- learn and Python functions. 38 . pytroll/satpy 1146 320 520 45 2025-08-02 Python package for reading, manipulat- ing and writing satellite data 39 . SciT ools/iris 692 297 109 23 2025-10-31 A po werful, format-agnostic, and community-driv en Python package for analysing and visualising Earth science data 40 . lmfit/lmfit-py 1164 290 205 8 2022-09-05 Non-Linear Least Squares Minimiza- tion, with flexible Parameter settings, based on scipy .optimize, and with many additional classes and methods for curve fitting. 41 . deepchecks/deepchecks 3924 286 99 9 2023-12-06 Deepchecks: T ests for Continuous V alidation of ML Models & Data. Deepchecks is a holistic open-source solution for all of your AI & ML valida- tion needs, enabling to thoroughly test your data and models from research to production. 42 . devitocodes/de vito 632 242 99 7 2025-07-24 DSL and compiler frame work for au- tomated finite-dif ferences and stencil computation 43 . danielgtaylor/p ython- betterproto 1733 233 42 1 2023-12-07 Better Protobuf / gRPC code generator and library for Python 44 . scikit-learn-contrib/metric- learn 1425 229 6 1 2017-11-27 Metric Learning in Python 45 . pydicom/pynetdicom 551 188 24 1 2025-05-24 A Python implementation of the DI- COM networking protocol 46 . scverse/anndata 667 175 142 17 2025-07-23 Annotated data matrix for single-cell genomics 47 . apache/arrow-adbc 498 160 571 63 2025-11-07 Database connecti vity API standard and libraries for Apache Arrow 48 . man-group/ArcticDB 2102 153 11 2 2025-11-19 ArcticDB is a high performance data store for time series and tick data 49 . stac-utils/pystac 412 127 48 1 2023-03-31 Python library for working with Spa- tioT emporal Asset Catalog (ST A C) Continued on next pa ge 42 Repository Name #Stars #Forks Filter Stage 1 Filter Stage 2 Latest T ask Date Description 50 . xdslproject/xdsl 433 125 2136 236 2025-11-04 A Python compiler design toolkit. 51 . ActivitySim/acti vitysim 217 117 51 10 2025-11-12 An open platform for acti vity-based trav el behavior modeling 52 . OGGM/oggm 245 115 484 36 2025-04-01 Open Global Glacier Model (OGGM): a modular frame work for glacier model- ing 53 . datalad/datalad 613 115 426 31 2024-09-10 Keep code, data, containers under con- trol with git and git-annex 54 . pydata/bottleneck 1144 112 61 20 2025-04-29 Fast NumPy array functions written in C 55 . wmayner/pyphi 406 100 25 1 2024-09-24 A toolbox for inte grated information theory . 56 . django-components/ django-components 1463 100 53 3 2025-09-30 Reusable, composable components for Django templates 57 . sourmash-bio/sourmash 524 88 297 27 2025-01-09 Quickly search, compare, and analyze genomic and metagenomic data sets. 58 . tskit-dev/msprime 201 88 209 9 2025-07-24 Simulate genealogical trees and ge- nomic sequence data using population genetic models 59 . numpy/numpy-financial 384 87 13 4 2024-04-04 Financial functions for NumPy 60 . makepath/xarray-spatial 894 85 38 9 2023-02-16 Spatial analysis algorithms for xarray implemented in numba 61 . dwa vesystems/dimod 135 84 152 20 2024-06-13 dimod is a shared API for samplers. 62 . python-hyper/h11 530 83 18 2 2025-01-12 A pure-Python, bring-your -own-I/O implementation of HTTP/1.1 63 . bjodah/chempy 611 81 69 1 2018-03-24 A package useful for chemistry written in Python 64 . holoviz/param 497 79 85 10 2025-02-27 Declarativ e parameters for robust Python classes and a rich API for re- activ e programming 65 . inducer/loopy 615 78 172 15 2023-07-27 A code generator for array computations on CPUs and GPUs 66 . holgern/beem 138 75 75 5 2020-12-22 A Python library for Hiv e and Steem 67 . scverse/spatialdata 329 75 20 2 2025-09-29 An open and interoperable data frame- work for spatial omics data 68 . pysb/pysb 188 71 107 7 2021-01-20 PySB is a frame work for building math- ematical models of biochemical systems as Python programs 69 . xorbitsai/xorbits 1199 70 186 22 2024-11-16 Xorbits is an open-source computing framew ork that makes it easy to scale data science and machine learning work- loads — from data preprocessing to tuning, training, and model serving. 70 . pysal/momepy 563 67 80 12 2024-07-16 Urban Morphology Measuring T oolkit 71 . python-adapti ve/adapti ve 1203 62 28 5 2025-08-21 :chart_with_upwards_trend: Adaptiv e: parallel acti ve learning of mathematical functions Continued on next pa ge 43 Repository Name #Stars #Forks Filter Stage 1 Filter Stage 2 Latest T ask Date Description 72 . probabilistic-numerics/ probnum 459 61 52 7 2023-05-04 Probabilistic numerics in Python 73 . neurostuff/NiMARE 197 60 14 1 2025-06-13 Coordinate- and image-based meta- analysis in Python 74 . NCAR/geocat-comp 140 56 18 2 2025-08-18 GeoCA T -comp provides implementa- tions of computational functions for operating on geosciences data. Many of these functions originated in NCL and were translated into Python. 75 . mie-lab/trackintel 243 53 55 5 2024-01-07 trackintel is a library for the analysis of spatio-temporal tracking data with a focus on human mobility . 76 . JD ASoftwareGroup/ kartothek 160 53 152 31 2021-03-17 A dataset library for partitioned datasets stored in Parquet 77 . AllenCellModeling/ aicsimageio 220 51 50 3 2023-04-05 Image Reading, Metadata Con version, and Image Writing for Microscop y Images in Python 78 . dottxt-ai/outlines-core 254 50 44 5 2025-03-31 Core library for Outlines, pro viding structured text generation utilities 79 . apache/arrow-nanoarro w 207 47 109 8 2025-10-27 nanoarrow: a (C) library for the Apache Arrow C Data interf ace 80 . pangeo-data/climpred 252 47 9 2 2021-11-20 :earth_americas: V erification of weather and climate forecasts :earth_africa: 81 . pybop-team/PyBOP 152 45 78 8 2025-07-15 A parameterisation and optimisation package for battery models. 82 . UXARRA Y/uxarray 202 44 99 22 2025-09-11 Python library for working with unstruc- tured grid model data in xarray 83 . pygeos/pygeos 388 43 101 17 2021-11-30 Wraps GEOS geometry functions in numpy ufuncs 84 . innobi/pantab 120 41 79 7 2024-10-31 Read/Write pandas DataFrames with T ableau Hyper Extracts 85 . xarray-contrib/xskillscore 237 41 23 1 2021-11-20 Metrics for verifying forecasts 86 . glotzerlab/signac 135 37 17 2 2025-04-04 Manage large and heterogeneous data spaces on the file system. 87 . sgkit-dev/sgkit 265 37 113 21 2025-09-30 Scalable genetics toolkit 88 . T ileDB-Inc/T ileDB-Py 198 36 51 5 2025-08-01 Python API for T ileDB 89 . IntelPython/dpctl 117 31 37 2 2025-10-02 Data Parallel Control (dpctl) - Python device control and USM memory for SYCL 90 . tensorwerk/hangar-p y 205 29 19 1 2019-12-04 Hangar is version control for tensor data. Commit, branch, mer ge, rev ert, and collaborate in the data-defined softw are era. 91 . xarray-contrib/xbatcher 184 28 20 3 2023-07-31 Batch generation from xarray objects. 92 . D ASDAE/dascore 121 26 122 11 2025-09-20 D ASCore: A Python package for the analysis of distrib uted acoustic sensing data. 93 . IntelPython/dpnp 116 23 680 26 2025-10-14 Data Parallel Extension for NumPy Continued on next pa ge 44 Repository Name #Stars #Forks Filter Stage 1 Filter Stage 2 Latest T ask Date Description 94 . not522/ac-library-python 230 23 5 2 2021-11-19 Python implementation of AtCoder Library 95 . xarray-contrib/flox 133 21 150 39 2025-07-17 Fast groupby reductions for dask and xarray 96 . scipp/scipp 136 21 268 26 2025-03-17 Python library for multi-dimensional data analysis 97 . pyapp-kit/psygnal 115 21 70 10 2025-09-24 Python observer pattern (callback/ev ent system). Modeled after Qt Signals & Slots (but independent of Qt) 98 . royerlab/ultrack 149 21 68 5 2025-09-23 Cell tracking and segmentation software 99 . xitorch/xitorch 155 21 9 2 2024-05-24 Differentiable scientific computing for PyT orch 100 . Quansight-Labs/ndindex 107 16 12 3 2025-05-14 A Python library for manipulating N- dimensional array indices 101 . jkjkil4/J Anim 189 14 3 1 2025-03-28 Programmatic animation engine for creating precise and smooth animations with real-time feedback 45 T able 16: Repositories and T asks represented in F O R M U L A C O D E (as of November 30, 2025). W e sho wcase a repository lev el breakdown of the number of tasks, the latest task (by PR merge date), the average dif ficulty (0-5, with 0 being easiest), the av erage number of tokens in the human patch and in the prompt instructions, and the most common optimization type of the human patch. Repository #T asks Latest T ask A vg. Diffi- culty A vg. Patch Size (T o- kens) A vg. PR Size (T okens) Most Common Optimization 1 . pandas-dev/pandas 222 2025-10-21 0.77 1842.85 489.35 Micro Optimizations (26.6%) 2 . scikit-learn/scikit-learn 143 2025-10-31 1.0 2735.29 491.49 Micro Optimizations (23.1%) 3 . Qiskit/qiskit 142 2025-10-03 1.73 4438.38 505.02 Use Lo wer Lev el System (28.2%) 4 . xdslproject/xdsl 134 2025-10-09 1.36 3567.76 463.46 Remov e Or Reduce W ork (37.3%) 5 . optuna/optuna 94 2025-11-05 0.96 546.29 471.81 Use Better Algorithm (24.5%) 6 . pydata/xarray 69 2025-11-21 0.98 1929.9 474.04 Micro Optimizations (30.4%) 7 . scikit-image/scikit- image 39 2024-11-20 0.83 2271.46 481.36 Remov e Or Reduce W ork (28.2%) 8 . networkx/networkx 35 2025-09-16 1.0 1809.74 480.46 Use Better Algorithm (42.9%) 9 . pytroll/satpy 30 2024-11-20 1.42 777.4 483.7 Use Better Data Structure And Layout (30.0%) 10 . pymc-de vs/pymc 18 2025-06-16 1.81 2589.89 479.89 Use Better Algorithm (33.3%) 11 . xarray-contrib/flox 17 2025-07-17 1.47 2149.24 485.18 Use Better Algorithm (29.4%) 12 . dwa vesystems/dimod 15 2024-06-13 1.33 2322.93 476.4 Use Better Algorithm (26.7%) 13 . geopandas/geopandas 13 2025-05-22 0.77 2231.62 497.15 Use Better Algorithm (46.2%) 14 . UXARRA Y/uxarray 13 2025-09-11 1.73 4722.15 489.38 Remov e Or Reduce W ork (23.1%) 15 . pydata/bottleneck 13 2020-11-25 1.54 1293.23 492.0 Use Lo wer Lev el System (38.5%) 16 . sgkit-dev/sgkit 12 2025-09-30 1.25 2231.67 469.0 Do It Earlier Batch Throttle (25.0%) 17 . sourmash-bio/sourmash 11 2022-07-20 1.36 2561.45 491.91 Use Better Algorithm (27.3%) 18 . JDASoftwareGroup/ kartothek 10 2020-10-01 0.5 1026.8 466.5 Micro Optimizations (40.0%) 19 . datalad/datalad 10 2021-03-19 0.25 597.5 492.8 Remov e Or Reduce W ork (40.0%) 20 . mars-project/mars 10 2023-02-16 1.75 3936.5 495.1 Micro Optimizations (30.0%) 21 . pysal/momepy 9 2024-07-16 1.39 3021.56 469.33 Use Better Algorithm (77.8%) 22 . T extualize/rich 9 2025-07-25 0.56 391.11 471.67 Micro Optimizations (55.6%) 23 . tskit-dev/msprime 7 2025-07-24 1.43 3013.43 468.86 Micro Optimizations (28.6%) 24 . pygeos/pygeos 7 2021-11-30 2.14 5001.57 483.43 Use Lo wer Lev el System (42.9%) 25 . microsoft/Qcodes 7 2025-08-27 0.71 800.43 467.71 Do It Earlier Batch Throttle (28.6%) 26 . napari/napari 7 2025-07-29 1.79 2595.86 485.71 Cache And Reuse (28.6%) 27 . shapely/shapely 6 2025-05-03 0.83 2131.5 480.17 Use Better Algorithm (33.3%) Continued on next pa ge 46 Repository #T asks Latest T ask A vg. Diffi- culty A vg. Patch Size (T o- kens) A vg. PR Size (T okens) Most Common Optimization 28 . pyapp-kit/psygnal 6 2025-09-24 0.83 1647.33 482.83 Remov e Or Reduce W ork (50.0%) 29 . ActivitySim/acti vitysim 6 2024-08-09 1.25 833.83 465.17 Remov e Or Reduce W ork (33.3%) 30 . pvlib/pvlib-python 5 2025-10-03 1.5 7490.2 482.6 Use Better Algorithm (40.0%) 31 . pybamm-team/ PyBaMM 5 2025-04-29 1.5 1637.6 496.8 Cache And Reuse (20.0%) 32 . D ASDAE/dascore 5 2025-09-20 1.5 5505.6 469.2 Cache And Reuse (40.0%) 33 . deepchecks/deepchecks 5 2023-12-06 1.5 3384.6 505.0 Use Better Algorithm (60.0%) 34 . modin-project/modin 5 2025-09-30 2.0 5533.0 481.0 Micro Optimizations (60.0%) 35 . mie-lab/trackintel 4 2024-01-07 0.62 1404.75 471.75 Use Better Algorithm (50.0%) 36 . lmfit/lmfit-py 4 2022-09-05 0.0 411.75 497.0 Do It Earlier Batch Throttle (25.0%) 37 . dottxt-ai/outlines-core 4 2025-03-31 0.62 5003.75 480.75 Remov e Or Reduce W ork (25.0%) 38 . pybop-team/PyBOP 4 2025-07-15 1.88 3863.0 464.5 Uncategorized (75.0%) 39 . sunpy/sunpy 4 2025-05-12 1.25 1852.25 486.25 Cache And Reuse (50.0%) 40 . SciT ools/cartopy 4 2025-04-26 1.88 1000.0 475.75 Cache And Reuse (50.0%) 41 . holgern/beem 4 2018-11-30 0.62 1302.5 462.0 Use Better Algorithm (50.0%) 42 . dipy/dipy 3 2025-03-12 0.83 803.67 523.67 Micro Optimizations (33.3%) 43 . kedro-or g/kedro 3 2025-07-17 0.83 1764.67 526.33 Cache And Reuse (66.7%) 44 . python-adapti ve/ adaptiv e 3 2025-08-21 0.0 1400.0 462.33 Cache And Reuse (33.3%) 45 . devitocodes/de vito 3 2025-07-22 2.5 2156.67 484.33 Cache And Reuse (66.7%) 46 . T ileDB-Inc/T ileDB-Py 3 2025-07-29 0.83 1823.33 482.0 Remov e Or Reduce W ork (33.3%) 47 . numpy/numpy-financial 2 2024-04-04 1.25 423.0 457.5 Use Lo wer Lev el System (100.0%) 48 . xarray-contrib/xbatcher 2 2023-01-03 2.5 2981.0 502.5 Do It Earlier Batch Throttle (50.0%) 49 . django-components/ django-components 2 2025-09-30 0.0 6528.0 463.0 Cache And Reuse (50.0%) 50 . glotzerlab/signac 2 2025-04-04 1.25 3955.0 532.5 Cache And Reuse (50.0%) 51 . dedupeio/dedupe 2 2023-02-17 2.5 709.0 503.0 Micro Optimizations (50.0%) 52 . NCAR/geocat-comp 2 2025-08-18 2.5 2615.0 498.5 Remov e Or Reduce W ork (50.0%) 53 . innobi/pantab 2 2024-01-22 0.0 650.5 446.5 Use Better Data Structure And Layout (50.0%) 54 . h5py/h5py 2 2025-05-23 2.5 550.5 548.5 Remov e Or Reduce W ork (50.0%) 55 . nilearn/nilearn 2 2025-10-09 0.0 4810.0 486.5 Micro Optimizations (50.0%) 56 . holoviz/param 2 2025-02-27 0.0 1287.0 473.5 Do It Earlier Batch Throttle (50.0%) Continued on next pa ge 47 Repository #T asks Latest T ask A vg. Diffi- culty A vg. Patch Size (T o- kens) A vg. PR Size (T okens) Most Common Optimization 57 . AllenCellModeling/ aicsimageio 1 2022-04-13 2.5 6813.0 505.0 Use Higher Le vel System (100.0%) 58 . HIPS/autograd 1 2017-10-21 0.0 525.0 463.0 Micro Optimizations (100.0%) 59 . OGGM/oggm 1 2022-09-07 0.0 511.0 442.0 Micro Optimizations (100.0%) 60 . arviz-devs/arviz 1 2024-05-10 0.0 299.0 458.0 Micro Optimizations (100.0%) 61 . danielgtaylor/python- betterproto 1 2023-12-07 0.0 2995.0 507.0 Use Lo wer Lev el System (100.0%) 62 . makepath/xarray- spatial 1 2022-05-12 2.5 3774.0 436.0 Use Lo wer Lev el System (100.0%) 63 . Quansight-Labs/ ndindex 1 2024-09-20 2.5 375.0 476.0 Use Lo wer Lev el System (100.0%) 64 . not522/ac-library- python 1 2021-11-19 0.0 388.0 441.0 Micro Optimizations (100.0%) 65 . royerlab/ultrack 1 2025-04-22 2.5 1816.0 437.0 Do It Earlier Batch Throttle (100.0%) 66 . stac-utils/pystac 1 2023-03-31 0.0 1593.0 461.0 Micro Optimizations (100.0%) 67 . tqdm/tqdm 1 2022-03-24 0.0 372.0 448.0 Micro Optimizations (100.0%) 68 . wmayner/pyphi 1 2024-09-24 2.5 1057.0 480.0 Remov e Or Reduce W ork (100.0%) 69 . xitorch/xitorch 1 2024-05-24 0.0 4352.0 479.0 Micro Optimizations (100.0%) 48 ···· ···· ···· Class Module Complete Workload Function Pandas (pd) pd.algorithms.* Stratified Speedup 1.01 Quantile.* Stratified Speedup 1.28 Hashing.* Stratified Speedup 0.95 time_quantile.* Stratified Speedup 1.64 time_dates.* Stratified Speedup 0.95 time_timedeltas.* Stratified Speedup 0.95 time_quantile(‘float’) Stratified Speedup 3.04 time_quantile(‘int’) Stratified Speedup 0.89 pd.algorithms.Quantile.* Figure 22: Illustration of Hierarchical Grouping of Pandas W orkloads. By construction, each workload in F O R M U - L A C O D E is organized hierarchically based on three lev els: ℓ = 1 (Module), ℓ = 2 (Class), and ℓ = 3 (Function). Metrics (like speedup agent and Adv agent ) are computed for each complete workload (leaf nodes). W e can semantically aggregate workloads by stratification of workloads based on this hierarcy . For instance, in this example, the stratified speedup of pd.algorithms.Quantile.* can be calculated by computing the geometric mean of all leaf nodes that share the same the prefix string (depicted in the gray dotted box; pd.algorithms.Quantile.time_quantile(‘float’) , pd.algorithms.Quantile.time_quantile(‘int’) , and other complete workloads not sho wn.). The example also illus- trates how highly localized optimizations are diluted by stratification, and underscores that, at higher levels of stratification, consistent speedups across a large number of workloads is required to achie ve a significant stratified speedup. 49 Expert Speedup Agent Speedup 1.0 1.0 Equal Advantage Super Optimization Under Optimization Performance Degradation Regression Figure 23: V isual intuition for Agent Advantage ( Adv agent ; § 2 ). Each cross ( ✗ ) represents an individual w orkload using the expert-deri ved speedup ( speedup expert ) and the agent-deriv ed speedup ( speedup agent ). The identity function line represents equal advantag e (i.e., speedup expert = speedup agent ). Then, the agent advantage is the mean weighted de viation from the equal adv antage line. The plot also sho wcases four optimization re gions clockwise from top: (1) Super Optimization : workloads where an agent’ s code performs better than the expert’ s code and the baseline. (2) Under Optimization : workloads where the agent’ s code and the expert’ s code both deliv er a positive speedup, but the expert outperforms the agent. (3) P erformance De gr edation : workloads where the expert discovers a speedup while the agent slows do wn the code. (4) Re gression : workloads where neither the expert nor the agent slow do wn the code; usually an intentional tradeof f to optimize other workloads. Figure 23 showcases an e xample of workload distrib ution for various agents on F O R M U L A C O D E . ≤0 0.5 0.75 1 1.25 1.5 1.75 2 >2 Expert Speedup ≤0 0.5 0.75 1 1.25 1.5 1.75 2 >2 Agent Speedup 2 2 2 5 2 2 1 2 1 2 1 1 4 15 5 32 55 1 1 1 1 2 1 10 247 137 4 1 1 1 1 5 11 233 443 25 13 1 1 1 2 1 1 8 1 14 10566 9 1 1 1 5 31 47 39 7 1 2 3 3 26 17 4 1 5 10 9 1 1 2 1 3 9 19 8 5 6 4 1 1 1 3 1 1 2 1 1 1 3 1 3 1 1 10 3 1 3 1 2 2 1 1 1 1 3 5 2 1 1 1 2 1 7 Claude 4.0 Sonnet (Advantage: -0.0410) ≤0 0.5 0.75 1 1.25 1.5 1.75 2 >2 Expert Speedup ≤0 0.5 0.75 1 1.25 1.5 1.75 2 >2 Agent Speedup 1 2 1 1 5 4 1 5 39 26 2 2 1 1 2 2 5 139 86 2 1 2 1 1 1 1 1 2 1 1 1 5 16 300 629 47 20 5 2 1 1 1 1 1 8 1 10 83 23 1 1 1 1 4 20 34 8 1 2 1 2 4 1 2 1 2 1 1 2 2 2 1 2 1 1 1 1 5 Qwen 3 Coder (Advantage: -0.0454) ≤0 0.5 0.75 1 1.25 1.5 1.75 2 >2 Expert Speedup ≤0 0.5 0.75 1 1.25 1.5 1.75 2 >2 Agent Speedup 2 2 2 5 2 2 1 2 1 4 1 5 15 5 29 60 1 1 1 1 2 1 7 226 121 4 1 1 1 1 5 14 252 445 13 13 1 1 1 1 1 1 7 1 16 114 77 7 1 1 1 1 5 31 48 42 7 1 2 3 3 26 17 4 1 5 10 9 2 1 2 1 2 9 19 8 5 6 4 1 1 1 3 1 1 2 1 2 1 3 3 1 1 1 10 3 3 1 2 2 1 1 1 1 3 5 1 1 1 2 1 10 Gemini 2.5 Pro (Advantage: -0.0433) ≤0 0.5 0.75 1 1.25 1.5 1.75 2 >2 Expert Speedup ≤0 0.5 0.75 1 1.25 1.5 1.75 2 >2 Agent Speedup 2 9 31 296 98 6 5 1 6 2 1 17 3 2 5 1 9 7 2 6 3 3 3 6 1 3 2 3 18 16 1 2 1 6 2 1 4 2 12 6 1 1 7 45 41 3 1 1 1 1 17 264 217 20 2 1 1 2 6 28 525 735 81 21 1 1 2 1 2 1 5 1 3 41 93 84 23 4 1 2 29 29 29 46 13 1 29 12 1 16 15 12 24 15 1 13 12 4 21 8 2 2 12 7 1 14 3 3 7 3 1 8 2 1 1 1 13 3 1 5 1 1 1 1 4 2 1 2 1 1 1 1 1 3 1 1 1 1 1 8 3 4 2 1 1 4 1 1 4 1 5 1 23 8 1 9 GPT -5 (Advantage: -0.0504) Figure 24: V isualization of advantage for T erminus 2 Agents. Refer to Figure 23 for an explanation of each re gion. Each square represents the number of workloads in that region (within 0 . 5 units). A speedup of 1 . 0 indicates no deviation from baseline performance. The red dotted line represents equal adv antage. This visualization is helpful to guage the holistic behavior of models across the entire workload distrib ution. F or instance, Claude 4.0 Sonnet (T op Left) achiev es a better overall adv antage than GPT -5 (Bottom Right) by making measured and sur gical optimizations that align with the equal-advantage line, whereas optimizations proposed by GPT -5 are more volatile, with more workloads experiencing performance degredations and ef fectively bringing the o verall advantage do wn. 51 1 OBJECTIVE 2 You are a performance optimization expert. Speed up the repository while maintaining correctness. 3 4 TOOLING 5 The micromamba environment includes Pytest for correctness testing and Airspeed Velocity (ASV) for benchmarking measurements and profiling. 6 7 PROCESS 8 1. Scan & Baseline 9 Read the code and any hints. Map likely bottlenecks. Establish a baseline by running the relevant ASV benchmarks. 10 2. Benchmark (ASV) 11 Read through relevant benchmarks. Prefer targeted runs using ’–bench=’; full-suite runs are discouraged. 12 Command: 13 ”’ asv run –python=same –bench="" ”’ 14 Find benchmarks via asv_benchmarks.txt or within the ASV benchmarks directory. You may run multiple benchmarks at once using regexes. 15 3. Profile Hotspots 16 Profile relevant benchmarks to locate hot paths. Use ASV’s built-in profiling support. 17 Command: 18 ”’ asv profile –python=same –config= ”’ 19 4. Optimize 20 Make targeted changes that address the hot paths while maintaining correctness. Follow the Operating Principles below. 21 22 OPERATING PRINCIPLES 23 • One change/command at a time (code edit, ASV run, profiling). 24 • Baseline first , then iterate. 25 • Target the hot paths shown by profiling. 26 • Evidence-driven : justify changes with benchmark/profile data. 27 • Correctness first : never trade correctness for speed. 28 29 REPOSITORY DESCRIPTION 30 This repository is called Qiskit/qiskit. Qiskit/qiskit is written primarily in Python and is described as a "Qiskit is an open-source SDK for working with quantum computers at the level of extended quantum circuits, operators, and primitives.". 31 32 TASK DESCRIPTION 33 Your main goal is to optimize the code to run as fast as possible. Use the following information if needed to understand the problem: 34 35 INITIAL OBSERVATIONS 36 Binding parameters with ‘ParameterExpression.bind‘ is slow, allocating many Python objects and taking tens of milliseconds per call when binding large dictionaries (e.g., 100k parameters). 37 38 RELEVANT ISSUES 39 40 Issue #14471: Addressing performance bottlenecks in ParameterExpression.bind 41 Environment: Qiskit version : 2.0.0 42 Summary: Let us consider a parameter expression ’expr’ and a dictionary ’parameter_values: dict[Parameter, float]’ with ’M’ key, value pairs. Consider the following code to bind the expression: 43 ”’ expression.bind(parameter_values)”’ 44 As it turns out, this line takes time that grows with len(M). As far as I can tell, this is because qiskit applies some checks to all of the parameters in parameter_values. Even if it turns out that expression only needs one of them, all the parameters are checked and then only one of them is used. 45 Why this needs fixing: Sometimes, it is useful to maintain a log of parameters outside of a circuit (e.g., in a parameter table) and bind these parameters when needed agains a ’parameter_values’ dict. In this case, the ’QuantumCircuit.assign_parameters’ method (which does some tricks to speed things up) is not available, and users take a hit in performance when they bind. 46 Some suggestions on how to fix this: Provide an option for users so that they can choose to check only the ’relevant’ parameter values (i.e., those present in expression), so that the runtime of bind becomes independent of len(M). Review the checks and remove those that are not needed. 47 How can we reproduce the issue? 48 ”’ from qiskit.circuit import Parameter 49 N: int = ... 50 parameter_values = {Parameter(f"th_{i}"): 1 for i in range(N)} 51 parameter_values[param := Parameter("my_param")] = 1 52 %timeit param.bind(parameter_values, allow_unknown_parameters=True)”’ 53 On my laptop, with N=1 bind takes ~2.5 µ s, but with N=10**5 it takes 17.8 ms. 54 Comments 55 I’d generally be supportive of removing huge tracts of the error-checking code from all the ParameterExpression methods. 56 Fwiw, there are a couple of tricks we ought to figure out: the ParameterExpression.bind method either has to be linear in the number of unbound parameters in the expression, or in the number of elements in the binding dictionary. . . . 57 . . . be cheaper even than adding fast-paths through ‘ParameterExpression.bind‘: we don’t need to maintain the QPY replay log and we don’t need to allocate a new ‘ParameterExpression‘ (which is quite heavy) Figure 25: Example task in F O R M U L A C O D E for Qiskit/qiskit (PR: https://github.com/Qiskit/qiskit/pull/ 14782 ). The prompt presents a complete optimization task, including the performance goal, the benchmarking and profiling tools (Pytest and ASV), a structured optimization workflo w , and concrete repository context with moti vating performance observations. The “Relev ant Issues” section contains GitHub issues that are directly related to the performance problem addressed by the PR (describing the underlying bottlenecks the PR aims to fix). These issues provide important background context that mimics a real, human-authored PR setting. Issue discussions are truncated only in this figure for brevity , while the full issue content is provided to the agent during e xecution. 52 1 OBJECTIVE 2 You are a performance optimization expert. Speed up the repository while maintaining correctness. 3 4 TOOLING 5 The micromamba environment includes Pytest for correctness testing and Airspeed Velocity (ASV) for benchmarking measurements and profiling. 6 7 PROCESS 8 1. Scan & Baseline 9 Read the code and any hints. Map likely bottlenecks. Establish a baseline by running the relevant ASV benchmarks. 10 2. Benchmark (ASV) 11 Read through relevant benchmarks. Prefer targeted runs using ’–bench=’; full-suite runs are too time-consuming and are discouraged. 12 Command: 13 ”’ # Always pin to current interpreter asv run –python=same –bench="" ”’ 14 Find benchmarks via asv_benchmarks.txt or in the directory containing the ASV benchmarks. You may run multiple benchmarks at once using regexes. 15 3. Profile Hotspots 16 Profile relevant benchmarks to locate hot paths. Use ASV’s built-in profiling support. 17 Command: 18 ”’ asv profile –python=same –config= ”’ 19 4. Optimize 20 Make targeted changes that address the hot paths while maintaining correctness. Always follow the Operating Principles below. 21 22 OPERATING PRINCIPLES 23 • One change/command at a time (code edit, ASV run, profiling). 24 • Baseline first , then iterate. 25 • Target the hot paths shown by profiling. 26 • Evidence-driven : justify changes with benchmark/profile data. 27 • Correctness first : never trade correctness for speed. 28 29 REPOSITORY DESCRIPTION 30 This repository is called shapely/shapely. shapely/shapely is written primarily in Python and is described as a "Manipulation and analysis of geometric objects". 31 32 TASK DESCRIPTION 33 Your main goal is to optimize the code to run as fast as possible. Use the following information if needed to understand the problem: 34 35 INITIAL OBSERVATIONS 36 The deprecate_positional decorator incurred a noticeable runtime penalty because it invoked the full inspect.signature machinery on every call, leading to slow polygon construction (e.g., ~107 ms per 1000 iterations in the main branch). Users also experienced repeated deprecation-warning processing overhead. 37 38 RELEVANT ISSUES 39 40 Issue #2280: 2.1 Polygon creation is much slower than 2.0.7 41 Summary: It seems to be that creating Polygons in 2.1 is much slower (roughly 5–10x) slower than 2.0.7. The following script takes roughly 0.1 seconds with Shapely 2.1 and 0.015 with Shapely 2.0.7 on Python 3.12. 42 ”’ import time 43 import shapely 44 if __name__ == "__main__": 45 start_time = time.time() 46 for _ in range(1000): 47 coords = ((0., 0.), (0., 1.), (1., 1.), (1., 0.), (0., 0.)) 48 polygon = shapely.Polygon(coords) 49 print(time.time() - start_time) ”’ 50 Comments: Thanks for the report. This slowdown seems to be due to the overhead of the decorator we added to deprecate positional arguments. That decorator does inspect the signature, which in . . . 51 . . . I noticed an even greater performance degradation when running under a debugger. 52 53 Issue #2282: deprecate_positional is a performance bottleneck (300%–1000% slowdown) in Shapely 2.1 54 Summary: Performance analysis indicates that only 17 seconds from 66 seconds total is the implementation of transform. The remaining time is taken by the deprecate_positional decorator. 55 I have the following code: 56 ”’ @overload 57 def compressible_geometry(geometry: _GeomT, /) -> _GeomT: ... 58 @overload 59 def compressible_geometry(geometry: NDArray[np.float64], /) -> NDArray[np.float64]: ... 60 . . . 61 Comments: - 62 Figure 26: Example task in F O R M U L A C O D E for shapely/shapely (PR: https://github.com/shapely/shapely/pull/ 2283 ). 1 OBJECTIVE 2 You are a performance optimization expert. Speed up the repository while maintaining correctness. 3 4 TOOLING 5 The micromamba environment includes Pytest for correctness testing and Airspeed Velocity (ASV) for benchmarking measurements and profiling. 6 7 PROCESS 8 1. Scan & Baseline 9 Read the code and any hints. Map likely bottlenecks. Establish a baseline by running the relevant ASV benchmarks. 10 2. Benchmark (ASV) 11 Read through relevant benchmarks. Prefer targeted runs using ’–bench=’; full-suite runs are too time-consuming and are discouraged. 12 Command: 13 ”’ # Always pin to current interpreter asv run –python=same –bench="" ”’ 14 Find benchmarks via asv_benchmarks.txt or in the directory containing the ASV benchmarks. You may run multiple benchmarks at once using regexes. 15 3. Profile Hotspots 16 Profile relevant benchmarks to locate hot paths. Use ASV’s built-in profiling support. 17 Command: 18 ”’ asv profile –python=same –config= ”’ 19 4. Optimize 20 Make targeted changes that address the hot paths while maintaining correctness. Always follow the Operating Principles below. 21 22 OPERATING PRINCIPLES 23 • One change/command at a time (code edit, ASV run, profiling). 24 • Baseline first , then iterate. 25 • Target the hot paths shown by profiling. 26 • Evidence-driven : justify changes with benchmark/profile data. 27 • Correctness first : never trade correctness for speed. 28 29 REPOSITORY DESCRIPTION 30 This repository is called pandas-dev/pandas. pandas-dev/pandas is written primarily in Python and is described as a "Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more". 31 32 TASK DESCRIPTION 33 Your main goal is to optimize the code to run as fast as possible. Use the following information if needed to understand the problem: 34 35 INITIAL OBSERVATIONS 36 The DataFrame.to_csv() call with index=False on a Multi-Index DataFrame was extremely slow ( ≈ 869 seconds for 10M rows × 20 cols), while resetting the index first and then calling to_csv() took only ≈ 42 seconds. The performance gap was observed consistently in the benchmark. 37 38 RELEVANT ISSUES 39 40 Issue #59312: PERF: Significant Performance Difference in DataFrame.to_csv() with and without Index Reset 41 Description: 42 Pandas version checks: I have checked that this issue has not already been reported. I have confirmed this issue exists on the latest version of pandas. I have not confirmed this issue exists on the main branch of pandas. 43 Reproducible Example 44 Below is a toy DataFrame example with 10M rows and 20 columns. The CSV write speed differ significantly between whether the multi-index is dropped first or not, even if the resulting CSV files are essentially the same. The benchmark for PyArrow is also attached for reference. Notice that the CSV generated from PyArrow has column names and column values additionally double-quoted. 45 ”’ import pandas as pd 46 import pyarrow as pa 47 import pyarrow.csv as csv 48 import time 49 NUM_ROWS = 10000000 50 NUM_COLS = 20 51 df = pd.DataFrame({f"col_{col_idx}": range(col_idx * NUM_ROWS, (col_idx + 1) * NUM_ROWS) for col_idx in range(NUM_COLS)}) . . . 52 Comments 53 Thanks for the report! It seems to me the issue is here: 54 ”’ https://github.com/pandas-dev/pandas/blob/642d2446060afb11f9860c79a7339eb6ec96fea7/pandas/io/formats/csvs.py#L323 ”’ 55 A significant amount of time on that line is spent getting the index values, only to be ignored because self.nlevels is 0 when index=False. In addition, it seems to me that there may . . . 56 Figure 27: Example task in F O R M U L A C O D E for pandas-dev/pandas (PR: https://github.com/pandas- dev/pandas/ pull/59608 ).

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment