LLMs can produce functionally correct programs, yet correctness alone does not guarantee reliability. Two programs passing the same tests can exhibit drastically different runtime behavior, creating hidden risks such as performance bottlenecks and memory leaks. Despite this, the runtime consistency of LLM-generated code remains largely unexplored. In this work, we introduce a framework to systematically quantify execution-time memory stability across multiple correct generations for the same task. We propose a novel solution-level metric, DMPD (Dynamic Mean Pairwise Distance), which uses Dynamic Time Warping to compare the shapes of memory usage profiles. These profiles, which we term Monotonic Peak Profiles (MPPs), are transformed to suppress transient noise, enabling robust comparison. By aggregating these scores, we derive a model-level Model Instability Score (MIS). Across the BigOBench and CodeContests benchmarks, we find substantial runtime divergence among correct solutions, revealing that instability often increases with higher sampling temperatures even as pass@1 improves. We also uncover exploratory correlations between our stability metrics and established software-engineering indicators (e.g., Cognitive and Cyclomatic Complexity), suggesting a link between operational behavior and code maintainability. These findings enable stability-aware selection of passing candidates in CI/CD pipelines, reducing operational risk without sacrificing correctness. Artifacts are available at https://github.com/pkrajput/memory_profiling. ★ Prateek's research is in collaboration with Zortify. † Yewei's research is in collaboration with BGL BNP Paribas. ‡ Aziz's research is in collaboration with B Medical Systems.
Large language models (LLMs) now routinely synthesize correct programs, with evaluation dominated by execution-based metrics such as pass@k on standardized suites [8,22,42]. However, correctness alone does not characterize how a model behaves at runtime once a solution has passed tests. In production, two equally correct solutions that exhibit different memory-allocation dynamics can have materially different cost and reliability profiles. Cloud platforms often charge in proportion to configured or consumed memory, and containerized deployments routinely fail due to outof-memory (OOM) events when limits are exceeded. Therefore, beyond functional correctness, measuring runtime consistency across a model's multiple valid generations for the same problem is important. Application-level memory traces are inherently temporal and noisy. In Python, reference counting and a cyclic garbage collector trigger non-deterministic reclamation and brief oscillations even for identical logic [1,31,38]. System-level indicators (e.g., RSS) further blur the signal by mixing allocator policy, fragmentation, and non-Python allocations [26]. To isolate contestant code while suppressing allocator churn, we instrument with tracemalloc and convert line-sampled currentbytes into a Monotonic Peak Profile (MPP), the cumulative maximum of the baseline-corrected series (Figure 1). This transform removes downward spikes from transient frees and GC cycles, yielding a non-decreasing envelope that highlights reproducible peak-growth events across runs and environments. We then compare unit-peak MPPs by shape, using time-elastic Dynamic Time Warping (DTW) rather than raw magnitudes.
Why runtime stability matters. Temporal instability in executiontime memory profiles is not noise; it reflects underlying algorithmic structure (Refer to Figure 3 for an example case), choices of data structures, buffering, copy patterns, etc., even when solutions are functionally correct. By contrast, widely used downstream similarity metrics compare surface forms or static structure: Code-BLEU augments n-gram overlap with syntax and dataflow cues [32], while AST-based measures such as Tree Structured Edit Distance (TSED/TED) quantify structural edits between program trees [37]. These are informative for code similarity, but they do not assess whether multiple correct generations from the same model exhibit consistent behavior at runtime. Yet this is a fundamental aspect of real-world software engineering: consistency drives cost, capacity planning, tail failures, and churn-maintainability. In this work, we fill that gap by measuring runtime stability directly via shape-aware, scale-robust comparisons of memory trajectories (DMPD) and aggregating them to model-level instability (MIS). We also report how these stability proxies relate to established maintainability indicators: Cyclomatic Complexity [24], Maintainability Index [9,28], and Cognitive Complexity [6] to ground operational variability in standard SE practice.
Much like the well-documented security vulnerabilities in generated code, which have necessitated the standardization of rigorous security scanning prior to acceptance [29,30], operational instability constitutes a parallel risk that hinders rapid adoption. In traditional container orchestration, deployment manifests mandate precise resource specifications (requests and limits) to govern binpacking efficiency and horizontal autoscaling [40]. Because these operational envelopes are tuned to historical baselines, an LLMgenerated patch that introduces unmeasured memory variance can silently invalidate capacity estimates, forcing operators to determine if the new solution respects existing constraints or requires expensive re-tuning. If we can design methodologies that quantify the operational risk and cost of implementing LLM-generated patches in production systems, it becomes far easier to integrate AI agents into real delivery pipelines. Our work is one such step toward that goal, offering a stability-focused signal for comparing correct patches.
We use competitive programming benchmarks, which may differ materially from industrial codebases, as they would typically avoid deep dependency graphs, long-lived architectural constraints, configuration drift, concurrency, and ecosystem-level integration testing. These differences limit direct claims about production outcomes from our work. However, they also offer a controlled lens for isolating confounding factors when analyzing stability: competitive settings allow us to compare multiple correct generations under fixed inputs, making it easier to disentangle algorithmic choice, caching behavior, allocator/GC effects, and other low-level runtime phenomena from the noise of evolving microservices and environment-specific deployment quirks. Importantly, the underlying substrate that shapes allocation dynamics, such as language runtimes, memory allocators, and garbage collection, remains the same class o
This content is AI-processed based on open access ArXiv data.