Beyond the Prompt: Assessing Domain Knowledge Strategies for High-Dimensional LLM Optimization in Software Engineering
Background/Context: Large Language Models (LLMs) demonstrate strong performance on low-dimensional software engineering optimization tasks ($\le$11 features) but consistently underperform on high-dimensional problems where Bayesian methods dominate. A fundamental gap exists in understanding how systematic integration of domain knowledge (whether from humans or automated reasoning) can bridge this divide. Objective/Aim: We compare human versus artificial intelligence strategies for generating domain knowledge. We systematically evaluate four distinct architectures to determine if structured knowledge integration enables LLMs to generate effective warm starts for high-dimensional optimization. Method: We evaluate four approaches on MOOT datasets stratified by dimensionality: (1) Human-in-the-Loop Domain Knowledge Prompting (H-DKP), utilizing asynchronous expert feedback loops; (2) Adaptive Multi-Stage Prompting (AMP), implementing sequential constraint identification and validation; (3) Dimension-Aware Progressive Refinement (DAPR), conducting optimization in progressively expanding feature subspaces; and (4) Hybrid Knowledge-Model Approach (HKMA), synthesizing statistical scouting (TPE) with RAG-enhanced prompting. Performance is quantified via Chebyshev distance to optimal solutions and ranked using Scott-Knott clustering against an established baseline for LLM generated warm starts. Note that all human studies conducted as part of this study will comply with the policies of our local Institutional Review Board.
💡 Research Summary
The paper tackles a well‑observed “dimensional barrier” in software‑engineering optimization: large language models (LLMs) can generate effective warm‑start configurations for low‑dimensional problems (≤5 features) but their performance collapses for medium (6‑11) and especially high (>11) dimensional tasks, where Bayesian optimizers such as Gaussian‑process‑based UCB remain superior. The authors hypothesize that the failure stems from the scarcity of high‑dimensional, domain‑specific examples in LLM training corpora, and that systematic injection of domain knowledge could bridge the gap.
To test this, they design four knowledge‑integration architectures that span a spectrum from heavy human involvement to full automation:
- Human‑in‑the‑Loop Domain Knowledge Prompting (H‑DKP) – experts iteratively review and refine constraints, feature relationships, and heuristics; each feedback round updates the prompt.
- Adaptive Multi‑Stage Prompting (AMP) – the LLM itself performs a four‑stage cycle (analysis → constraint identification → generation → validation), thereby generating its own knowledge without human input.
- Dimension‑Aware Progressive Refinement (DAPR) – the problem is first reduced to a statistically important sub‑space; the LLM is guided through progressively expanding dimensions, mitigating the curse of dimensionality.
- Hybrid Knowledge‑Model Approach (HKMA) – combines Retrieval‑Augmented Generation (RAG) to pull recent documentation/code and Tree‑of‑Parzen‑Estimators (TPE) to provide probabilistic priors, merging data‑driven and semantic cues.
Experiments use the MOOT repository, a curated collection of >120 software‑engineering optimization datasets. The authors stratify datasets into low (<6), medium (6‑11), and high (>11) dimensional tiers, selecting at least ten datasets per tier for statistical power. Each method is run for 20 independent trials per dataset. The primary performance metric is Chebyshev distance to the Pareto‑optimal configuration (after normalizing objectives). Secondary metrics include diversity of generated samples (average pairwise Euclidean distance) and computational cost measured in API token usage. Baselines are Random sampling, GP‑UCB (state‑of‑the‑art Bayesian optimizer), and a standard few‑shot LLM warm‑start (BS_LLM). Statistical significance is assessed via Scott‑Knott clustering and Effect‑Size Difference (ESD) testing.
Key Findings
- HKMA delivers the largest gains on high‑dimensional problems, reducing average Chebyshev distance by ~27% relative to the BS_LLM baseline. The RAG component supplies up‑to‑date domain constraints, while TPE supplies a probabilistic prior that guides the LLM toward promising regions.
- DAPR excels in the medium‑dimensional tier; its progressive expansion of feature space yields stable improvements without excessive computational overhead.
- AMP shows that fully automated multi‑stage reasoning can improve over the baseline but still lags behind methods that incorporate explicit expert knowledge, indicating that LLMs alone struggle to discover high‑quality constraints in complex spaces.
- H‑DKP achieves statistically significant improvements across all tiers, confirming that expert‑provided structural constraints are valuable. However, it incurs the highest token cost and requires 5‑10 feedback rounds per dataset; performance plateaus after roughly seven rounds, suggesting diminishing returns on additional human effort.
- Cost‑effectiveness analysis reveals that HKMA offers the best performance‑per‑token ratio, making it the most practical for real‑world deployment where API costs matter.
- An ablation of knowledge categories shows that structural constraints (e.g., “feature A must be ≤ 2× feature B”) and learned feature correlations contribute the bulk of the performance boost, while heuristic tips (e.g., “prefer lower memory usage”) provide modest additional benefit.
- The study also quantifies the trade‑off between human effort and automation: while human feedback yields higher absolute gains, automated RAG+TPE achieves comparable gains at a fraction of the cost, especially valuable when expert availability is limited.
Implications
The work demonstrates that LLMs, despite their limited intrinsic capacity to model high‑dimensional interactions, can become competitive warm‑start generators when enriched with domain knowledge. Importantly, the hybrid RAG‑TPE approach shows that leveraging external textual resources and lightweight statistical priors can substitute much of the expert labor while preserving performance. This opens a pathway for scalable, cost‑effective warm‑starting of software‑engineering optimizers in settings where labeling budgets are tight and high‑dimensional configuration spaces are common (e.g., cloud service tuning, compiler flag selection, hyper‑parameter optimization for ML pipelines).
Future research directions suggested include: (1) extending the retrieval component to multimodal sources (e.g., code repositories, performance logs), (2) exploring active learning loops where the optimizer queries the LLM for additional constraints, and (3) applying the framework to other domains such as hardware design or cyber‑physical system configuration. Overall, the paper provides a rigorous empirical foundation for integrating human and automated knowledge into LLM‑driven optimization pipelines, offering both methodological insights and practical guidelines for researchers and practitioners.
Comments & Academic Discussion
Loading comments...
Leave a Comment