LOGICAL-COMMONSENSEQA: A Benchmark for Logical Commonsense Reasoning
Commonsense reasoning often involves evaluating multiple plausible interpretations rather than selecting a single atomic answer, yet most benchmarks rely on single-label evaluation, obscuring whether statements are jointly plausible, mutually exclusive, or jointly implausible. We introduce LOGICAL-COMMONSENSEQA, a benchmark that re-frames commonsense reasoning as logical composition over pairs of atomic statements using plausibility-level operators (AND, OR, NEITHER/NOR). Evaluating instruction-tuned, reasoning-specialized, and fine-tuned models under zero-shot, few-shot, and chain-of-thought prompting, we find that while models perform reasonably on conjunctive and moderately on disjunctive reasoning, performance degrades sharply on negation-based questions. LOGICAL-COMMONSENSEQA exposes fundamental reasoning limitations and provides a controlled framework for advancing compositional commonsense reasoning.
💡 Research Summary
The paper introduces LOGICAL‑COMMONSENSEQA, a new benchmark that reframes commonsense reasoning as logical composition over pairs of atomic statements using three plausibility‑level operators: AND, OR, and NEITHER/NOR. Existing commonsense QA datasets such as CommonsenseQA present a single‑answer multiple‑choice format, which masks the fact that many real‑world questions admit several plausible answers that may be jointly possible, mutually exclusive, or jointly impossible. By converting each original question into a set of options that are logical combinations of two independently generated atomic answers, the authors create a controlled environment for testing compositional commonsense reasoning while preserving the familiar MCQ format.
Dataset construction proceeds in three stages. First, the authors sample 5,000 questions from the original CommonsenseQA and use GPT‑4o‑mini to over‑generate 4‑6 candidate atomic answers per question, explicitly prompting for both plausible and implausible alternatives. Second, a refinement step filters out factually incorrect, trivially solvable, or semantically incoherent candidates, yielding three high‑quality correct options and four distractors per question. Human validation follows a “awareness‑consensus” protocol: two annotators judge each atomic answer for personal plausibility and perceived social agreement, with disagreements adjudicated. Inter‑annotator agreement (Cohen’s κ = 0.49) reflects the inherent subjectivity of commonsense judgments. Third, a deterministic script pairs atomic answers and assigns one of the three operators, producing 19,996 final instances (4,999 per operator). An additional “MIXED” condition randomly mixes operators across answer choices to prevent models from exploiting operator‑specific patterns.
The benchmark is split into 11,996 training, 6,000 development, and 2,000 test items, stratified to keep operator distribution even. Evaluation uses macro‑F1 and accuracy on both a human‑validated (HV) half of the test set and a non‑validated (NV) half.
Experiments cover a broad spectrum of models: decoder‑only LLMs (LLaMA‑3.3‑70B, LLaMA‑3.1‑8B, Qwen2.5‑7B) evaluated under zero‑shot and three‑shot prompting, and encoder‑decoder models (Flan‑T5‑base, DeBERTa‑v3‑base) fine‑tuned on the training data. Results reveal consistent patterns:
- Conjunctive (AND) reasoning – LLMs achieve 70‑86 % F1, indicating they can assess the independent plausibility of both statements reasonably well.
- Disjunctive (OR) reasoning – Performance drops modestly to 60‑78 % F1, showing models can detect that at least one statement is plausible but with less confidence.
- Negation (NEITHER/NOR) reasoning – Scores collapse to 6‑14 % F1 across all prompting regimes, exposing a severe inability to recognize that both statements are jointly implausible.
- MIXED condition – F1 falls further to 40‑55 %, confirming that models rely heavily on surface cues tied to a single operator and struggle when they must infer the operator per option.
Fine‑tuned models dramatically improve across the board, reaching 83‑93 % F1 for all operators, which suggests that the task is learnable given sufficient supervised signal. Error analysis attributes the NOR failure primarily to the models’ lack of explicit negation handling; many errors involve overlooking that both statements violate commonsense expectations. Additional confusion stems from semantically overlapping options that blur the distinction between “both plausible” and “one plausible”.
The authors argue that traditional single‑answer benchmarks overestimate LLM commonsense abilities because they never test relational plausibility judgments. LOGICAL‑COMMONSENSEQA therefore provides a diagnostic tool that isolates compositional reasoning, highlights a specific weakness in handling negation, and offers a platform for future work on operator‑aware prompting, meta‑learning of logical composition, and expanded operator sets (e.g., implication, exclusivity, temporal causality).
Limitations acknowledged include the restricted operator set (no implication or temporal logic), the focus on MCQ rather than generative settings, and the limited model diversity (decoder‑only models only evaluated via prompting). Ethical considerations note that the data are derived from publicly available sources, contain no personal or sensitive information, and were curated with human oversight to mitigate bias.
In summary, LOGICAL‑COMMONSENSEQA reframes commonsense QA as a test of logical composition over plausibility, reveals that state‑of‑the‑art LLMs handle conjunction and disjunction reasonably but fail dramatically on negation, and sets the stage for research aimed at endowing models with robust, operator‑grounded commonsense reasoning.
Comments & Academic Discussion
Loading comments...
Leave a Comment