Programming Language Confusion: When Code LLMs Can't Keep their Languages Straight

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Large Language Models (LLMs) have achieved state-of-the-art performance across software engineering tasks, from code generation to translation. However, we identify and systematically evaluate a critical failure mode: Programming Language Confusion (PLC) – the generation of code in unintended languages despite explicit instructions. Through evaluation of 10 popular LLMs across six multilingual datasets (LiveCodeBench, BabelCode variants, HumanEval-XL, and McEval), we demonstrate that PLC is pervasive, with some specialized models exhibiting the highest confusion rates. Our analysis reveals that PLC is not random noise but reflects systematic patterns: models consistently generate syntactically valid code even when it deviates from language specifications. This behavior produces distinct language migration patterns, most notably a strong default to Python and systematic shifts between syntactically similar language pairs (e.g., C#/Java). These migrations reflect statistical preferences learned from training data rather than goal-directed reasoning. We demonstrate that explicit language keywords provide the most effective mitigation, while natural language instructions have limited influence on model behavior. Furthermore, model quantization – though essential for practical deployment – significantly amplifies PLC and degrades syntactic stability in complex tasks. Our findings underscore that language fidelity should be treated as a core evaluation dimension for code LLMs. We advocate for standardized benchmarks and prompt formats with explicit language constraints to enable more reliable assessment and foster the development of robust, multilingual code generation systems.

💡 Research Summary

The paper “Programming Language Confusion: When Code LLMs Can’t Keep Their Languages Straight” investigates a previously under‑explored failure mode of large language models (LLMs) used for code generation and translation: Programming Language Confusion (PLC). PLC occurs when a model produces code in a language different from the one explicitly requested by the user, even though the prompt clearly specifies the target language.

Research Scope and Methodology
The authors evaluate ten popular LLMs—five general‑purpose models (GPT‑3.5‑Turbo, GPT‑4.1‑Mini, DeepSeek‑V2‑Lite‑Instruct‑16B, Llama‑3.1‑8B‑Instruct, Mistral‑7B‑Instruct, Qwen2.5‑Instruct‑14B) and five code‑specialized models (CodeLlama‑13B‑Instruct, DeepSeek‑Coder‑V2‑Lite‑Instruct‑16B, Qwen2.5‑Coder‑Instruct‑14B, Starcoder2‑15B‑Instruct). The models span 7 B to 16 B parameters and include both full‑precision and quantized (4‑bit/8‑bit) variants.

Six multilingual benchmark suites are used, covering both code generation and code translation tasks: LiveCodeBench, BabelCode‑HumanEval, BabelCode‑MBPP, BabelCode‑TP3, HumanEval‑XL, and McEval. In total, roughly 75 000 program samples across 16 programming languages and up to 23 natural languages are examined.

To quantify PLC, three metrics are introduced:

Language Confusion Pass Rate (LCPR) – the proportion of generated snippets that match the target language.
Code Parsing Pass Rate (CPPR) – the proportion of snippets that parse without syntax errors in the detected language.
Dominant Migration Rate (DMR) – for confused samples, the share that migrate to a particular “default” language (most often Python).

Language detection is performed by an ensemble of four tools (Highlight.js, Guesslang, Philomath‑1209, PLangRec) to ensure robust identification. All experiments use greedy decoding (temperature = 0) to minimize stochastic variation.

Key Findings

Pervasiveness of PLC – Across all models, LCPR rarely exceeds 70 %; many models fall below 60 %. Code‑specialized models such as CodeLlama‑13B and DeepSeek‑Coder‑V2 exhibit especially low LCPR, with a strong bias toward generating Python code even when another language is requested.
Systematic Migration Patterns – The DMR analysis shows that roughly 40‑45 % of confused outputs default to Python. Additionally, language pairs with high syntactic similarity (e.g., C# ↔ Java, C ↔ C++) frequently exchange code, indicating that the models rely on learned statistical preferences rather than a true understanding of language boundaries.
Syntactic Validity Despite Confusion – Confused snippets are often syntactically correct; CPPR remains above 80 % for most models. This means that PLC can silently produce runnable code in the wrong language, potentially leading to subtle bugs or security issues that are hard to detect.
Prompt Engineering Impact – Explicit language keywords (“Write the solution in Java”) improve LCPR by an average of 20 percentage points, whereas natural‑language instructions alone have minimal effect. This suggests that models treat language tags as strong conditioning signals.
Quantization Amplifies PLC – Quantized versions of the models show a consistent drop in LCPR (≈ 15 pp) and CPPR (≈ 10 pp) compared with their full‑precision counterparts. The authors attribute this to the compression of probability distributions, which magnifies the model’s bias toward its most “comfortable” language.

Implications and Recommendations

Treat Language Fidelity as a Core Evaluation Dimension – Existing code benchmarks focus on functional correctness but overlook whether the generated code respects the requested language. Adding LCPR (or similar) to standard evaluation suites would surface PLC early in model development.
Standardize Prompt Formats – Incorporating explicit language tags and language‑specific keywords should become a best practice for both research and production usage.
Architectural Adjustments – Future code LLMs could benefit from language‑segregated token embeddings, language‑aware attention masks, or auxiliary loss terms that penalize cross‑language leakage during multi‑language pre‑training.
Quantization‑Aware Deployment – When deploying quantized models, developers should run a PLC audit and possibly add a post‑generation language verification step to filter out mis‑tagged code.

Conclusion
Programming Language Confusion is not a random glitch but a systematic, data‑driven phenomenon that manifests as a strong default to Python and as migrations between syntactically similar languages. Explicit language cues mitigate the issue, while quantization aggravates it. The authors argue that reliable multilingual code generation demands that language fidelity be measured, reported, and actively improved in both model training and prompt design.

Programming Language Confusion: When Code LLMs Can't Keep their Languages Straight

💡 Research Summary

Comments & Academic Discussion

Leave a Comment