From Prompts to Performance: Evaluating LLMs for Task-based Parallel Code Generation
Large Language Models (LLM) show strong abilities in code generation, but their skill in creating efficient parallel programs is less studied. This paper explores how LLMs generate task-based parallel code from three kinds of input prompts: natural language problem descriptions, sequential reference implementations, and parallel pseudo code. We focus on three programming frameworks: OpenMP Tasking, C++ standard parallelism, and the asynchronous many-task runtime HPX. Each framework offers different levels of abstraction and control for task execution. We evaluate LLM-generated solutions for correctness and scalability. Our results reveal both strengths and weaknesses of LLMs with regard to problem complexity and framework. Finally, we discuss what these findings mean for future LLM-assisted development in high-performance and scientific computing.
💡 Research Summary
The paper investigates how well large language models (LLMs) can automatically generate task‑based parallel code, a capability that is crucial for high‑performance scientific computing but remains under‑explored. The authors consider three distinct prompting strategies—(1) a natural‑language problem description only, (2) a full sequential C++ implementation, and (3) a structured parallel pseudo‑code outline—and apply them to three widely used parallel programming models: OpenMP tasking, C++ standard parallel algorithms, and the asynchronous many‑task runtime HPX.
Four benchmark problems are selected to span a range of algorithmic patterns and parallelization challenges: (1) a stochastic approximation of π (embarrassingly parallel), (2) merge sort (recursive task creation with explicit dependencies), (3) dense matrix‑matrix multiplication (compute‑bound with high arithmetic intensity), and (4) the Conjugate Gradient method (iterative solver with reductions and strict ordering). These benchmarks stress different aspects of task granularity, dependency inference, and synchronization.
Three LLMs are evaluated: the proprietary ChatGPT‑5, Google Gemini‑3, and the open‑source Qwen‑Coder 30B. For each combination of benchmark, prompt type, model, and parallel framework the authors generate multiple code snippets and assess them along three axes: correctness (using the pass@k metric), intrinsic code complexity (lines of code, comment density, cyclomatic complexity measured by the scc tool), and scalability (strong and weak scaling up to 128 threads on a dual‑socket AMD EPYC 7742 system). A composite Parallel Code Generation Quality Score (PCGQS) is introduced, averaging functional correctness and a normalized scaling score.
Key findings:
- Model performance – ChatGPT‑5 consistently achieves the highest pass@1 (≈ 85 % overall), followed by Gemini‑3 (≈ 78 %). Qwen‑Coder lags, especially on C++ standard parallelism and OpenMP, but its pass@1 rises to near‑perfect levels when medium‑level human fixes are allowed.
- Framework impact – OpenMP tasking yields the most robust results across all models; C++ standard parallelism performs excellently for ChatGPT and Gemini but poorly for Qwen. HPX is the most challenging, with initial pass@1 below 30 % for every model, reflecting frequent errors in futures handling, namespace usage, and data‑sharing.
- Prompt influence – The type of prompt has only a modest effect on raw correctness. Adding parallel pseudo‑code provides a slight improvement, but natural‑language‑only prompts are not dramatically worse. This suggests current LLMs still struggle to infer parallel structure from minimal cues.
- Code complexity – Gemini‑3 inserts the most comments, enhancing readability, while ChatGPT produces the leanest code with the lowest cyclomatic complexity. Prompts that include sequential code tend to produce more structurally complex output because the model preserves the original control flow and merely wraps parallel constructs around it.
- Scalability – OpenMP and C++ standard parallelism both exhibit near‑linear strong scaling up to 128 threads for the easier benchmarks (π, matrix multiplication) and reasonable scaling for the more synchronization‑heavy ones (merge sort, CG). HPX scaling deteriorates sharply after 32 threads, especially on CG, due to overhead from fine‑grained task management and incorrect future usage.
- Human‑in‑the‑loop – Allowing easy fixes (e.g., missing headers, minor lambda capture issues) lifts pass@1 by 10–15 percentage points. Medium fixes (e.g., correcting namespace mismatches, adding missing
.get()calls) bring most model‑benchmark pairs to 100 % correctness, highlighting that many failures are syntactic or API‑level rather than algorithmic.
The authors conclude that while LLMs can generate syntactically correct parallel code, their ability to produce high‑performance, correctly synchronized implementations varies strongly with the abstraction level of the target framework and the intrinsic difficulty of the problem. OpenMP, being directive‑based, aligns well with current model capabilities; higher‑level declarative APIs (C++ standard) are also tractable for the best models, whereas fully asynchronous runtimes like HPX remain out of reach without substantial human intervention.
Future work is suggested in three directions: expanding the evaluation to GPU‑oriented models (CUDA, SYCL, Kokkos), integrating automatic post‑generation debugging and optimization pipelines, and enriching training data with more examples of fine‑grained task‑based code to improve models’ understanding of asynchronous runtimes. The study provides a valuable benchmark suite and a set of metrics (pass@k, scc‑based complexity, PCGQS) that can serve as a foundation for subsequent research on LLM‑assisted HPC software development.
Comments & Academic Discussion
Loading comments...
Leave a Comment