HypoSpace: Evaluating LLM Creativity as Set-Valued Hypothesis Generators under Underdetermination
As language models are increasingly used in scientific workflows, evaluating their ability to propose sets of explanations-not just a single correct answer-becomes critical. Many scientific problems are underdetermined: multiple, mechanistically distinct hypotheses are consistent with the same observations. We introduce HypoSpace, a diagnostic suite that treats LLMs as samplers of finite hypothesis sets and measures three complementary indicators: Validity (precision of proposals consistent with observations), Uniqueness (non-redundancy among proposals), and Recovery (coverage of the enumerated admissible set). We instantiate HypoSpace in three structured domains with deterministic validators and exactly enumerated hypothesis spaces: (i) causal graphs from perturbations, (ii) gravity-constrained 3D voxel reconstruction from top-down projections, and (iii) Boolean genetic interactions. Across instruction-tuned and reasoning-focused models, Validity often remains high while Uniqueness and Recovery degrade as the admissible space grows, revealing mode collapse that is invisible to correctness-only metrics. HypoSpace offers a controlled probe-rather than a leaderboard-for methods that explicitly explore and cover admissible explanation spaces. Code is available at: https://github.com/CTT-Pavilion/_HypoSpace.
💡 Research Summary
The paper introduces HypoSpace, a diagnostic benchmark designed to evaluate large language models (LLMs) on their ability to generate sets of plausible scientific hypotheses when observations are underdetermined—that is, when many mechanistically distinct explanations can account for the same data. Traditional AI‑science benchmarks focus on single‑answer correctness, which masks whether a model can explore the full space of admissible explanations. HypoSpace treats an LLM as a sampler over a finite hypothesis space (H_O) that is explicitly enumerated for each test instance, allowing precise measurement of three complementary metrics:
- Validity (VR) – the proportion of generated hypotheses that satisfy all given observations. This measures appropriateness or correctness of each sample.
- Uniqueness (NR) – the proportion of generated hypotheses that are novel relative to previously sampled ones, after applying a domain‑specific canonicalizer to collapse semantically equivalent forms. This captures originality and avoids redundancy.
- Recovery (RR) – the coverage of the enumerated admissible set, i.e., the fraction of distinct valid hypotheses discovered out of the total (|H_O|). This reflects fluency or the ability to explore the hypothesis space comprehensively.
The benchmark comprises three structured domains where ground‑truth admissible sets can be exactly enumerated:
- Causal graph inference – given binary response vectors from single‑node interventions, models must propose all DAGs consistent with the observed descendant patterns.
- 3‑D voxel reconstruction under gravity – from a top‑down binary projection, models must generate all voxel stacks that satisfy the projection and a gravity‑based stacking rule.
- Boolean genetic interaction modeling – from phenotype observations of parental gene combinations, models must output Boolean expressions that reproduce the data; expressions are canonicalized to account for logical equivalences.
Each domain includes controllable difficulty parameters (number of nodes, grid dimensions, Boolean operator depth) that scale the size of (|H_O|). The authors evaluate a suite of recent instruction‑tuned and reasoning‑focused LLMs (e.g., GPT‑4o, Gemini‑2.5‑Pro, Claude‑Opus‑4, GPT‑5, DeepSeek‑r1) using a fixed sampling budget (N) (typically set equal to (|H_O|) to enable full‑coverage analysis). Results show a consistent pattern: Validity remains high (often >80 %), but Uniqueness and Recovery degrade sharply as the admissible set grows. In the hardest Boolean tasks, Recovery can fall below 20 %, indicating severe mode collapse—models repeatedly generate a small subset of admissible hypotheses despite being able to produce valid ones.
To explain this phenomenon, the paper provides a simple probabilistic analysis. If the model’s induced distribution over (H_O) is peaked—most probability mass concentrated on a small “head” of hypotheses—then the expected number of distinct hypotheses discovered after (N) draws grows roughly as (K + N(1-\alpha)), where (K) is the head size and (\alpha) the head’s total probability. When (\alpha) is close to 1, the slope (1-\alpha) is tiny, so even large (N) yields low coverage. This analysis predicts the observed “high‑VR, low‑RR” regime and shows that full recovery would require a sampling budget inversely proportional to the smallest tail probability, which can be astronomically large.
The authors also experiment with stratified decoding, a simple technique that forces the sampler to explore less‑probable regions by varying temperature or top‑k across decoding steps. This mitigates mode collapse modestly, raising Recovery by 10–15 % in several settings, demonstrating that the issue is largely a sampling‑distribution problem rather than a fundamental limitation of the underlying model.
Importantly, HypoSpace is positioned as a diagnostic probe rather than a competitive leaderboard. By providing deterministic validators and exact ground‑truth hypothesis sets, it eliminates the need for human judges and enables reproducible, fine‑grained analysis of a model’s exploratory capabilities. The authors acknowledge that real scientific discovery involves richer, often non‑enumerable hypothesis spaces, so HypoSpace abstracts core elements (consistency checking, combinatorial hypothesis enumeration) to focus on measurement precision and cross‑model comparability.
In conclusion, the paper reveals that current state‑of‑the‑art LLMs, despite strong reasoning performance on single‑answer tasks, struggle to systematically explore underdetermined hypothesis spaces. This has implications for AI‑augmented scientific discovery pipelines, where generating diverse, valid hypotheses is crucial. Future work suggested includes (1) training objectives that penalize peakedness, (2) decoding strategies that explicitly promote diversity, and (3) meta‑learning approaches that adapt sampling distributions based on observed coverage. HypoSpace offers a concrete, extensible benchmark for tracking progress on these fronts.
Comments & Academic Discussion
Loading comments...
Leave a Comment