FrontierCS: Evolving Challenges for Evolving Intelligence
We introduce FrontierCS, a benchmark of 156 open-ended problems across diverse areas of computer science, designed and reviewed by experts, including CS PhDs and top-tier competitive programming participants and problem setters. Unlike existing benchmarks that focus on tasks with known optimal solutions, FrontierCS targets problems where the optimal solution is unknown, but the quality of a solution can be objectively evaluated. Models solve these tasks by implementing executable programs rather than outputting a direct answer. FrontierCS includes algorithmic problems, which are often NP-hard variants of competitive programming problems with objective partial scoring, and research problems with the same property. For each problem we provide an expert reference solution and an automatic evaluator. Combining open-ended design, measurable progress, and expert curation, FrontierCS provides a benchmark at the frontier of computer-science difficulty. Empirically, we find that frontier reasoning models still lag far behind human experts on both the algorithmic and research tracks, that increasing reasoning budgets alone does not close this gap, and that models often over-optimize for generating merely workable code instead of discovering high-quality algorithms and system designs.
💡 Research Summary
FrontierCS introduces a large‑scale, open‑ended benchmark designed to evaluate large language models (LLMs) on genuine computer‑science problems where the global optimum is unknown or computationally infeasible. The benchmark comprises 156 curated tasks split into two tracks: 107 algorithmic problems and 49 research‑oriented problems. Algorithmic tasks are derived from competitive programming contests but are deliberately transformed into optimization, constructive, or interactive variants that admit many valid solutions and are scored partially rather than binary. Research tasks span six domains—operating systems, high‑performance computing, artificial intelligence, databases, programming languages, and security—and are taken from real research workflows, each equipped with a deterministic evaluator that can automatically verify correctness and compute a quality metric (e.g., latency‑accuracy trade‑off, packing density, query count).
Each problem follows a rigorous three‑stage curation pipeline: (1) Proposal by experts (ICPC World‑Finalist level) who submit original sources and describe intended modifications; (2) Implementation, where the problem is converted into an open‑ended version, input/output formats are standardized, a partial‑scoring verifier is built, and a human‑authored reference solution is provided that clearly outperforms current LLM baselines; (3) Review by a second expert to ensure the absence of a known optimal solution, the discriminative power of the scoring function, and the correctness of the evaluator. This process guarantees high quality, diversity, and reproducibility.
The evaluation protocol requires models to generate a self‑contained executable program given only the problem statement and any required API stubs. The program is run on a suite of generated instances under strict time and memory limits; the evaluator then returns a numeric score reflecting solution quality. This design shifts the focus from “does the model produce a correct answer?” to “how well does the model design and implement an effective algorithm or system?”.
Empirical results show a substantial gap between state‑of‑the‑art LLMs (including GPT‑5, Claude, Gemini) and human experts. On the algorithmic track, human experts achieve an average packing density of 87 % on the Polyomino Packing task, whereas GPT‑5 reaches only 47 %. Across all algorithmic problems, humans outperform models by roughly 30 % absolute score, and on the research track the gap widens to over 40 %. Scaling reasoning resources—longer context windows, more chain‑of‑thought steps—yields diminishing returns on the hardest instances, indicating that simply increasing compute does not solve the underlying reasoning bottleneck. Moreover, models tend to over‑optimize for “working code” that passes the verifier, neglecting deeper optimization of the objective metric, suggesting a lack of strategic exploration.
FrontierCS therefore serves both as a diagnostic benchmark and as a training platform. Because scores are automatically generated and differentiable, they can be used as reinforcement‑learning rewards or for self‑play regimes, enabling future work to explore meta‑learning, curriculum design, and prompt‑engineering techniques aimed at improving open‑ended algorithmic reasoning. The authors argue that closing the observed performance gap will require new model architectures or training objectives that better capture long‑horizon planning, combinatorial search, and domain‑specific knowledge integration. In sum, FrontierCS provides the first comprehensive, cross‑domain, objectively scored benchmark that pushes LLMs toward true frontier reasoning in computer science, and it establishes a clear research agenda for the next generation of AI systems capable of genuine algorithm and system design.
Comments & Academic Discussion
Loading comments...
Leave a Comment