ExPairT-LLM: Exact Learning for LLM Code Selection by Pairwise Queries
Despite recent advances in LLMs, the task of code generation is still challenging. To cope, code selection algorithms select the best program from multiple programs generated by an LLM. However, existing algorithms can fail to identify the correct program, either because they can misidentify nonequivalent programs or because they rely on an LLM and assume it always correctly determines the output for every input. We present ExPairT-LLM, an exact learning algorithm for code selection that selects a program by posing to an LLM oracle two new types of queries: pairwise membership and pairwise equivalence. These queries are simpler for LLMs and enable ExPairT-LLM to identify the correct program through a tournament, which is robust to some LLM mistakes. We evaluate ExPairT-LLM on four popular code datasets. Its pass@1 (success rate) outperforms the state-of-the-art code selection algorithm on average by +13.0% and up to +27.1%. It also improves the pass@1 of LLMs performing complex reasoning by +24.0%.
💡 Research Summary
The paper tackles the problem of selecting the correct program from a set of candidates generated by large language models (LLMs). While LLMs have made impressive strides in code generation, the raw outputs often contain bugs, do not satisfy the natural‑language specification, or even fail to compile. Existing code‑selection techniques either rely on LLM‑generated input‑output examples, clustering based on those examples, or neural‑network estimators of correctness. These approaches can fail when the examples are wrong, when they do not differentiate between nonequivalent programs, or when a single LLM mistake propagates to a wrong final choice.
To overcome these limitations, the authors adapt the classic exact‑learning framework (Angluin, 1987) to the code‑selection setting, but replace the traditional membership and equivalence queries— which are impractical for LLMs— with two new query types that LLMs can answer reliably:
-
Pairwise Membership Query – Given a task description, a set of inputs I, and two lists of outputs O₁ and O₂ (produced by two candidate programs on I), the oracle (the LLM) is asked which output list better satisfies the task. The answer is simply O₁ or O₂.
-
Pairwise Equivalence Query – Given a task and two candidate programs p₁, p₂, the oracle is asked whether the programs are semantically equivalent with respect to the task. If they are not, the oracle returns a differentiating input x such that p₁(x) ≠ p₂(x).
Both queries are “pairwise” in nature, which aligns with recent findings that LLMs excel at judging between two alternatives. The pairwise equivalence query also supplies a concrete counterexample that can be validated by actually executing the programs, thus providing a safety net against oracle mistakes.
The proposed algorithm, ExPairT‑LLM (Exact Learning by Pairwise Tournament), proceeds as follows:
-
Clustering – For the current set of candidate programs P and a current input set I, each program is executed on I. Programs that produce identical output vectors are grouped into clusters C₁,…,C_k.
-
Tournament (Copeland’s method) – For every pair of clusters (C_i, C_j) a pairwise membership query is issued. The winning cluster receives a point; after all pairwise comparisons the cluster with the highest score becomes the “selected cluster” C*.
-
Equivalence Check – Within C* the algorithm issues pairwise equivalence queries between the first program and every other program. If any query reports non‑equivalence, the returned differentiating input x is executed on both programs to confirm the discrepancy. If the validation succeeds, x is added to the input set I (replacing the previous I), and the whole process repeats, thereby refining the clusters.
-
Termination – When the selected cluster contains only mutually equivalent programs, the algorithm returns any program from that cluster.
The algorithm has two built‑in robustness mechanisms: (a) the tournament aggregates many pairwise judgments, so a single erroneous LLM answer has limited impact on the final ranking; (b) differentiating inputs are verified by actual execution, preventing spurious equivalence claims from corrupting the process.
Theoretical analysis shows that if the LLM oracle is always correct, ExPairT‑LLM is an exact learner: it will always identify the unique correct program. The number of queries is bounded by O(|P|²) for each query type, which is practical for candidate sets of a few hundred programs. The authors also derive lower bounds on the probability of selecting the correct cluster when the oracle is noisy, demonstrating graceful degradation.
Empirical Evaluation
The authors evaluate ExPairT‑LLM on four widely used code‑generation benchmarks:
- HumanEval (Chen et al., 2021)
- MBPP‑sanitized (Austin et al., 2021)
- APPS (Hendrycks et al., 2021)
- LiveCodeBench (Jain et al., 2025)
They compare against the state‑of‑the‑art code‑selection methods B4 (Chen et al., 2024a) and CODET (Chen et al., 2023a). Results:
- Average pass@1 improvement over B4: +13.0%, with a maximum gain of +27.1% on a single dataset.
- Average pass@1 improvement over CODET: +16.6%.
Furthermore, they test the impact of ExPairT‑LLM on three powerful LLMs that already perform complex reasoning:
- OpenAI o1‑mini – pass@1 increase of +32.8%.
- DeepSeek‑R1 – pass@1 increase of +20.4%.
- Gemini 2.5 Flash – pass@1 increase of +18.9%.
Ablation studies reveal that pairwise membership queries contribute +24.6% to the overall gain, while pairwise equivalence queries add +7.7%, confirming that both components are essential.
Strengths and Contributions
- Novel Query Design – The introduction of pairwise membership and equivalence queries that align with LLM strengths.
- Robust Tournament Mechanism – Use of Copeland’s method to aggregate pairwise judgments, mitigating individual LLM errors.
- Dynamic Input Generation – Leveraging differentiating inputs from equivalence queries to adaptively refine the test set, something most prior methods lack.
- Theoretical Guarantees – Proof of exact learning under a perfect oracle and bounded query complexity.
- Strong Empirical Gains – Consistent, sizable improvements across diverse benchmarks and LLM backbones.
Limitations and Future Work
- The method assumes that the correct program is present in the candidate set; if the generation step fails to produce a correct solution, ExPairT‑LLM cannot recover.
- Query cost (LLM token usage and latency) grows quadratically with the number of candidates, which may be prohibitive for very large candidate pools.
- The approach relies on the LLM’s ability to produce meaningful differentiating inputs; for tasks requiring sophisticated data structures or domain‑specific knowledge, generating such inputs may still be challenging.
- Extending the framework to handle stochastic or non‑terminating programs, perhaps via timeout handling or probabilistic equivalence, is an open direction.
Conclusion
ExPairT‑LLM presents a principled, LLM‑centric solution to the code‑selection problem. By reframing exact learning with pairwise queries that LLMs can answer reliably, and by embedding a tournament‑based aggregation and a verification loop, the authors achieve substantial accuracy gains over prior state‑of‑the‑art methods. The work demonstrates that careful alignment of algorithmic design with LLM capabilities can dramatically improve the reliability of LLM‑generated code, paving the way for more trustworthy automated programming assistants.
Comments & Academic Discussion
Loading comments...
Leave a Comment