The Strategic Foresight of LLMs: Evidence from a Fully Prospective Venture Tournament

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Can artificial intelligence outperform humans at strategic foresight – the capacity to form accurate judgments about uncertain, high-stakes outcomes before they unfold? We address this question through a fully prospective prediction tournament using live Kickstarter crowdfunding projects. Thirty U.S.-based technology ventures, launched after the training cutoffs of all models studied, were evaluated while fundraising remained in progress and outcomes were unknown. A diverse suite of frontier and open-weight large language models (LLMs) completed 870 pairwise comparisons, producing complete rankings of predicted fundraising success. We benchmarked these forecasts against 346 experienced managers recruited via Prolific and three MBA-trained investors working under monitored conditions. The results are striking: human evaluators achieved rank correlations with actual outcomes between 0.04 and 0.45, while several frontier LLMs exceeded 0.60, with the best (Gemini 2.5 Pro) reaching 0.74 – correctly ordering nearly four of every five venture pairs. These differences persist across multiple performance metrics and robustness checks. Neither wisdom-of-the-crowd ensembles nor human-AI hybrid teams outperformed the best standalone model.

💡 Research Summary

The paper investigates whether large language models (LLMs) can outperform human decision‑makers in strategic foresight – the ability to make accurate ex‑ante judgments about high‑stakes, uncertain outcomes. To avoid the common pitfall of “data leakage” (where models are evaluated on events that already exist in their training data), the authors designed a fully prospective prediction tournament using live Kickstarter technology projects that launched after the training cut‑offs of all models under study.

Thirty U.S.‑based technology ventures were selected while their fundraising campaigns were still active, meaning the true amount of money each would raise was unknown at the time of prediction. The authors generated 870 pairwise comparison tasks (each venture versus every other venture in both presentation orders) and asked participants to decide which of the two would raise more funds. This double round‑robin format yields a complete ranking while reducing noise inherent in direct ranking prompts.

Two human groups provided forecasts: (1) 346 experienced U.S. managers recruited via Prolific, who completed the same pairwise task; and (2) three MBA‑trained investors who produced full rankings under monitored, “no‑technology” conditions. On the AI side, a diverse set of frontier and open‑weight LLMs—including GPT‑5 variants, Claude 4.5, Gemini 2.5 (Pro and Basic), Grok 4, Gemma 3, Llama 3.1, and DeepSeek 3.2—were prompted to perform the same 870 comparisons. The models used consistent prompting, temperature settings, and chain‑of‑thought reasoning to generate a probability that venture A would outperform venture B; tournament scoring then produced a final ranking for each model.

Performance was measured against the realized fundraising totals using Spearman rank correlation and pairwise‑accuracy (the proportion of correctly ordered venture pairs). Human evaluators achieved correlations ranging from 0.04 to 0.45, with an average around 0.28. In stark contrast, several frontier LLMs exceeded 0.60, and the best model, Gemini 2.5 Pro, reached a correlation of 0.74, correctly ordering roughly 79 % of the 870 pairs. Statistical tests (bootstrap confidence intervals, permutation tests) confirmed that the AI‑human gaps were highly significant (p < 0.001).

The authors also examined whether aggregating forecasts could improve results. “Wisdom‑of‑the‑crowd” ensembles of all human predictions and hybrid human‑AI teams (human rankings weighted by AI scores) failed to beat the top standalone LLM. Regression analyses linking model performance to established AI benchmarks (e.g., MMLU, BIG‑Bench) showed that larger parameter counts and stronger high‑order reasoning abilities were positively associated with forecasting accuracy.

Theoretical framing positions strategic foresight as a Brunswikian alignment problem: decision‑makers must match their internal models to the latent generative structure of the environment. Human bounded rationality—limited processing capacity, systematic biases, and noisy attention—has long been documented as a source of forecasting error. LLMs, by contrast, embody a form of “unbounded rationality,” leveraging massive textual corpora, extensive computational resources, and consistent inference mechanisms. The study therefore reframes a classic strategy question (who predicts better?) from a human‑centric variance issue to a competition between fundamentally different forecasting regimes.

Limitations are acknowledged. The sample size (30 ventures) and focus on a single crowdfunding platform may limit external validity; results might differ in other industries (e.g., biotech, heavy manufacturing) where success drivers are less publicly observable. Human participants operated under time and tool constraints that differ from real‑world investment settings where analysts can gather additional data and collaborate. Moreover, the exact prompts, temperature values, and any post‑processing of LLM outputs were not fully disclosed, which could affect reproducibility.

In conclusion, the paper provides the first robust, prospective evidence that state‑of‑the‑art LLMs can substantially out‑perform experienced managers and trained investors in a real‑world strategic prediction task. This finding has profound implications for the adoption of AI in strategic decision‑making: firms may consider deploying LLMs as independent forecasting engines rather than merely as decision‑support chatbots. Future research should expand the benchmark to diverse markets, explore longer‑term outcomes, and investigate optimal human‑AI collaboration protocols that combine the speed and consistency of LLMs with the contextual judgment and implementation capabilities of human strategists.

The Strategic Foresight of LLMs: Evidence from a Fully Prospective Venture Tournament

💡 Research Summary

Comments & Academic Discussion

Leave a Comment