MHRC-Bench: A Multilingual Hardware Repository-Level Code Completion benchmark

MHRC-Bench: A Multilingual Hardware Repository-Level Code Completion benchmark
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Large language models (LLMs) have achieved strong performance on code completion tasks in general-purpose programming languages. However, existing repository-level code completion benchmarks focus almost exclusively on software code and largely overlook hardware description languages. In this work, we present \textbf{MHRC-Bench}, consisting of \textbf{MHRC-Bench-Train} and \textbf{MHRC-Bench-Eval}, the first benchmark designed for multilingual hardware code completion at the repository level. Our benchmark targets completion tasks and covers three major hardware design coding styles. Each completion target is annotated with code-structure-level and hardware-oriented semantic labels derived from concrete syntax tree analysis. We conduct a comprehensive evaluation of models on MHRC-Bench-Eval. Comprehensive evaluation results and analysis demonstrate the effectiveness of MHRC-Bench.


💡 Research Summary

The paper addresses a glaring gap in the evaluation of large language models (LLMs) for code completion: while many benchmarks exist for general‑purpose software languages, hardware description languages (HDLs) have been largely ignored. To fill this void, the authors introduce MHRC‑Bench, a multilingual, repository‑level benchmark specifically designed for hardware code completion. The benchmark comprises two parts: MHRC‑Bench‑Train, a large training corpus, and MHRC‑Bench‑Eval, a carefully curated evaluation set.

Dataset construction
The authors harvested open‑source hardware repositories from GitHub that satisfy several criteria: permissive licenses (e.g., MIT), creation after the year 2000, at least five stars, and exclusion of copyleft or vendor‑specific directories. After de‑duplication, they obtained 584 distinct repositories containing 47,175 source files across four language families: (1) Register‑Transfer Level (RTL) languages – SystemVerilog/Verilog and VHDL, (2) High‑Level Synthesis (HLS) C/C++, (3) Domain‑Specific Language (DSL) – Chisel, and (4) a small set of auxiliary files. Each file contributes exactly one completion target, selected by parsing the file with tree‑sitter into a concrete syntax tree (CST) and randomly picking a syntactically complete node. This ensures that the target spans are always well‑formed code fragments rather than arbitrary character spans.

Fine‑grained annotation
MHRC‑Bench‑Eval provides two orthogonal annotation layers:

  1. Code‑structure level – The depth of the selected CST node is recorded, and nodes are bucketed into five depth‑based categories, reflecting increasing structural complexity (e.g., top‑level module declarations versus inner expression statements).

  2. Hardware‑semantic level – Nine semantic categories are defined, derived from a detailed analysis of hardware design and verification workflows. The categories include Design Structure, Declaration & Definition, Storage Block, Computation Block, Control Flow Block, Interface, Property & Assertion, Testbench Stimulus, and Monitoring & Checking. Five hardware experts validated the labeling scheme.

These annotations enable researchers to dissect model performance not only by raw token accuracy but also by structural depth and domain‑specific intent.

Experimental setup
The authors evaluate a spectrum of models: open‑source code LLMs (DeepSeek‑Coder‑5.7B, Qwen2.5‑Coder‑3B/7B/14B) and commercial offerings (GPT‑5, Gemini 2.5 Pro, Grok‑4, DeepSeek V3.2). They report three metrics:

  • Exact Match (EM) – token‑level agreement with the ground‑truth target.
  • Exact Structure (ES) – agreement on both tokens and the associated structural/semantic label.
  • Compilation Pass Rate – whether the generated HDL code successfully compiles or synthesizes using standard toolchains.

Models are first evaluated in their vanilla, pre‑trained state, then fine‑tuned on MHRC‑Bench‑Train for one to two epochs.

Key findings

  • Pre‑trained models perform poorly on hardware code: EM scores hover near 0 % and ES is similarly low, confirming that generic code training does not transfer to HDL domains.
  • Fine‑tuning on MHRC‑Bench‑Train yields dramatic improvements. The Qwen2.5‑Coder‑7B model, after fine‑tuning, achieves EM and ES improvements of roughly 30 % absolute and doubles the compilation pass rate.
  • Remarkably, the fine‑tuned 7 B model outperforms the larger 14 B pre‑trained model across all three metrics, demonstrating that domain‑specific data can outweigh raw model size.
  • Language‑specific analysis shows that RTL languages (Verilog/SystemVerilog and VHDL) are hardest for models, especially in the Design Structure and Interface categories, likely due to complex port declarations and hierarchical module instantiation. HLS C/C++ exhibits higher accuracy in Computation Block and Property categories, reflecting the models’ prior exposure to C‑style syntax.
  • Tasks with higher cross‑file dependency counts (average 3–6 per target) see a steep performance drop, highlighting a current limitation: LLMs struggle to retrieve and integrate information spread across multiple repository files.

Contributions

  1. Benchmark creation – The first large‑scale, multilingual repository‑level hardware code completion benchmark with fine‑grained structural and semantic annotations.
  2. Dataset release – Publicly available training and evaluation splits, ensuring no overlap between them and providing realistic cross‑file contexts.
  3. Empirical insights – Comprehensive evaluation of both open‑source and commercial models, revealing that modestly sized models can surpass larger ones when fine‑tuned on domain‑specific data, and exposing specific weakness areas (e.g., interface handling, cross‑file reasoning).

Implications and future work
MHRC‑Bench establishes a solid foundation for systematic research on LLM‑assisted hardware design. The authors suggest extending the benchmark to cover additional verification artifacts (e.g., formal property languages), integrating retrieval‑augmented generation to better handle cross‑file dependencies, and exploring end‑to‑end pipelines that include testbench generation, simulation, and automated debugging.

In summary, the paper delivers a well‑engineered dataset, a rigorous evaluation protocol, and compelling evidence that targeted fine‑tuning dramatically improves LLM performance on hardware code completion, thereby opening a new research frontier at the intersection of AI and digital hardware design.


Comments & Academic Discussion

Loading comments...

Leave a Comment