Are LLMs Stable Formal Logic Translators in Logical Reasoning Across Linguistically Diversified Texts?
Logical reasoning with large language models (LLMs) has received growing attention. One mainstream approach translates natural language into formal logic and then applies symbolic solvers for deduction. While effective in many tasks, these LLM-based translators often fail to generate consistent symbolic representations when the same concept appears in different linguistic forms. Such inconsistencies break logical coherence and lead to solver errors. However, most existing benchmarks lack this type of linguistic variation, which frequently occurs in real-world text, leaving the problem underexplored. To address this gap, we present SoLT, a benchmark that systematically rewrites reasoning datasets into diverse yet logically equivalent forms across multiple levels. Beyond evaluation, SoLT also provides a general method to enrich any dataset with linguistic diversity while preserving both meaning and logic. To further enhance the stability of LLM-based reasoning, we propose MenTaL, which explicitly guides models to build a concept-symbol mapping table during translation. By linking equivalent expressions to shared symbols, MenTaL maintains consistency and mitigates symbol drift. Experiments on SoLT demonstrate that LLMs indeed suffer from inconsistent symbol mapping under linguistic variation, leading to significant drops in reasoning accuracy. Meanwhile, applying MenTaL brings clear and stable performance improvements across diverse inputs. Overall, our findings reveal that overlooking linguistic diversity hides key weaknesses in LLM-based translators, and our work offers a step toward more reliable logical reasoning in varied real-world scenarios. Our code is available at https://github.com/wufeiwuwoshihua/LinguDiver.
💡 Research Summary
This paper investigates a critical weakness in the emerging paradigm of using large language models (LLMs) as translators that convert natural‑language reasoning problems into formal logic, which is then solved by symbolic solvers. The authors identify “symbol drift” – the phenomenon where semantically equivalent expressions are mapped to different logical symbols – as a major source of failure. Existing logical reasoning benchmarks largely consist of templated, uniform language, causing repeated concepts to be expressed with identical surface forms. Consequently, they do not test whether a model can maintain a consistent concept‑to‑symbol mapping when faced with realistic linguistic variation.
To expose this gap, the authors first conduct a controlled experiment with GPT‑4 as the translator and a suite of symbolic solvers. They introduce four types of mild, meaning‑preserving perturbations—third‑person reference, synonym substitution, part‑of‑speech shift, and syntactic transformation—into six established reasoning datasets (e.g., FOLIO, ProntoQA, ProverQA). Across all perturbation types, translation accuracy drops by 0.11–0.33 points, with third‑person reference and synonym substitution causing the most severe degradation. Error analysis reveals that the drops are almost entirely due to inconsistent symbol assignments, confirming that even subtle linguistic changes can break the downstream reasoning chain.
In response, the paper makes two major contributions. First, it introduces SoLT (Stable Logic Translation), a benchmark that systematically rewrites existing reasoning problems into linguistically diverse yet logically invariant variants. The SoLT pipeline operates in three stages: (1) identification of repeated concepts within a problem, (2) generation of multiple diversified alternatives at word, phrase, and syntactic levels, and (3) semantic filtering to retain only those rewrites that preserve the original meaning and logical structure. This process yields a dataset with 3–5× higher lexical and syntactic diversity while keeping the ground‑truth logical formulas unchanged, thereby providing a rigorous testbed for translation stability.
Second, the authors propose MenTaL (Mental Representation Table‑guided Logic), a lightweight framework that forces the model to construct a global concept‑symbol mapping table before emitting logical forms. MenTaL works by prompting the LLM to first list all repeated or synonymous concepts, assign each a unique logical symbol, and then use this table during the actual translation step. The approach is compatible with both prompt‑only (closed‑source) settings and fine‑tuning (open‑source) regimes, making it broadly applicable. By referencing a shared table, the model maintains consistent symbol usage across varied expressions, directly mitigating symbol drift.
Extensive experiments on the SoLT benchmark evaluate several state‑of‑the‑art LLMs (GPT‑4, Claude‑2, Llama‑2) with and without MenTaL. Without MenTaL, all models suffer substantial accuracy losses (average 18–30% drop) on the diversified inputs. With MenTaL, the same models recover a large portion of the lost performance, reducing the drop to 5–10% and achieving statistically significant improvements on the most complex tasks. A dedicated “symbol consistency score” further confirms that MenTaL dramatically increases the proportion of globally consistent symbol assignments.
The paper concludes that (1) linguistic diversity is an essential factor for evaluating the robustness of LLM‑based logical translators, (2) SoLT provides a scalable, logic‑preserving diversification method that can be applied to any existing reasoning dataset, and (3) MenTaL offers a practical, model‑agnostic solution to the symbol drift problem. The authors suggest future directions such as extending diversification to multilingual and culturally specific variations, learning to automatically expand and refine the concept‑symbol table via meta‑learning, and tighter integration with symbolic solvers to improve overall pipeline efficiency.
Comments & Academic Discussion
Loading comments...
Leave a Comment