INDUCTION: Finite-Structure Concept Synthesis in First-Order Logic
We introduce INDUCTION, a benchmark for finite structure concept synthesis in first order logic. Given small finite relational worlds with extensionally labeled target predicates, models must output a single first order logical formula that explains the target uniformly across worlds, with correctness verified via exact model checking. The benchmark includes three regimes, FullObs, CI (contrastive), and EC (existential completion), nd penalizes formula bloat. We find sharp difficulty gradients, persistent hard structural families, and observe that low bloat formulas generalize far better on held out worlds. Elite recent models show qualitatively different behaviors across tasks and performance metrics, hinting to their different strategies of concept generalization.
💡 Research Summary
The paper introduces INDUCTION, a new benchmark suite for finite‑structure concept synthesis in first‑order logic (FOL). The core task is: given a collection of small finite relational worlds, each annotated with an extensional target predicate T(x), a model must output a single FOL formula φ(x) that exactly captures T in every world. Because each world is finite, correctness can be mechanically verified by exhaustive model checking or SMT solving, eliminating any reliance on natural‑language interpretation.
Three complementary regimes are defined. FullObs (Full Observation) provides complete interpretations of all predicates (P, Q, R, S) in each world; φ must match T in all worlds. CI (Contrastive Induction) splits worlds into YES and NO groups. φ must be an exact match on every YES world while failing to be an exact match on any NO world, thereby forcing discriminative hypotheses that use negative evidence. EC (Existential Completion) hides a fraction of ground atoms; a formula is considered valid if, for each world, there exists some completion of the unknown atoms that makes φ match the observed target extension. This regime tests reasoning under partial information.
Beyond raw accuracy, the authors stress the importance of formula “bloat”. They measure syntactic complexity via AST size (number of nodes) and quantifier depth, and report gold‑relative success rates (Acc@(gold + Δ)) as well as a “bloat rate” indicating how often a correct answer exceeds the size of the planted gold formula by a large margin. This dual metric discourages models from solving instances by generating long, case‑splitting formulas that merely exploit accidental regularities.
Data generation is carefully engineered to produce controlled difficulty. A curated pool of ~200 gold formulas, tagged by quantifier structure (single‑∃, single‑∀, nested, etc.) and sub‑family, serves as the source of concepts. For each gold formula, a “trap pool” of plausible shortcuts (atomic predicates, low‑depth formulas) and near‑miss mutants is built. In FullObs, worlds are generated incrementally; each new world must “kill” at least one surviving trap, ensuring that the set of worlds eliminates many spurious hypotheses. Rejection filters discard instances where any atomic or quantifier‑free formula already explains the data, guaranteeing that genuine quantifier reasoning is required.
CI generation extends this idea: YES worlds are added while tracking surviving traps, and the number of survivors is kept within a narrow band (typically 2–4 total, 1–2 near‑miss). NO worlds are then constructed to be exactly satisfied by at least one surviving trap, so any model that relies on a shortcut will inevitably fail on a NO world. This contrastive “trap” mechanism makes the negative evidence highly informative.
EC instances mask about 20 % of binary atoms (R and S) and, in the harder band, only R atoms. The unknown atoms are explicitly marked in the prompt. Validity is checked by encoding the unknowns as Boolean variables and asking a Z3 solver whether there exists an assignment that makes φ’s extension equal the target. This existential completion semantics captures the idea of “there exists a plausible world consistent with the observed facts”.
The benchmark defines several difficulty bands. FullObs includes six bands ranging from simple (k = 4 worlds, quantifier depth = 1) to extreme (k = 10, lift‑hard formulas where a relation involving the free variable appears inside a universal quantifier). CI has a core band (no lift‑hard formulas, 7–8 YES worlds, 2–3 NO worlds) and a lift‑mix band (35 % lift‑hard formulas). EC has a core band (3 worlds, domain size 6–8, 20 % unknowns) and a hard band (larger domains, depth‑2 gold formulas, unknowns only on R).
Experiments with state‑of‑the‑art large language models and specialized reasoning systems reveal sharp performance gradients across bands. In FullObs, success rates drop dramatically as quantifier depth and domain size increase. In CI, models that rely on simple shortcuts succeed on YES worlds but are caught by the carefully crafted NO worlds; only those that learn to use negative evidence achieve high accuracy. In EC, some models can find completions that satisfy the formula, but performance degrades for deeper quantifier structures. Crucially, solutions whose AST size stays close to the gold formula’s size generalize far better to held‑out worlds than bloated solutions, confirming the utility of the bloat‑aware scoring.
The authors conclude that INDUCTION provides a rigorously defined, solver‑verifiable platform for evaluating logical induction, discriminative reasoning, and reasoning under uncertainty. By coupling accuracy with syntactic compactness, the benchmark encourages the development of models that truly abstract relational concepts rather than overfit by generating verbose case‑splitting expressions. This work lays groundwork for future research on symbolic reasoning, concept learning, and the use of large language models in scientific hypothesis generation.
Comments & Academic Discussion
Loading comments...
Leave a Comment