Frontier-CS Blog Posts

May 12, 2026 · 14 min read

Your Next Long-Context Recipe: Open-Ended Problems

Coding data lifts reasoning. Agentic coding is dominant. We introduce FrontierCS: 172 open-ended problems with continuous scoring, all one harbor run away. Kimi K2.6 and Claude Opus 4-7 go head-to-head, sustaining up to 456 turns, 405 tool calls, and 531K output tokens per problem.

Apr 21, 2026 · 10 min read

Frontier-CS Goes Live: 2,000 Humans vs. AI on an Open-Ended Problem

We placed an open-ended optimization problem in CALICO, UC Berkeley's official programming contest. To score, you had to beat a frontier AI. Out of 2,000+ contestants, only one submission surpassed the strongest AI agent.

Mar 17, 2026 · 7 min read

Auto Building Agent Memory for Open Algorithm Problems: Evaluating ALMA on Frontier CS

Frontier-CS is an open benchmark that facilitates evaluation of evolving agents. We demonstrate how researchers build ALMA to automatically learn continual learning AI researchers on Frontier CS.

Mar 10, 2026 · 12 min read

Evaluating Evolving Agent Systems at Scale with Frontier-CS

Evolving agent systems are advancing fast, but evaluation hasn't kept up. We show how Frontier-CS enables comprehensive, large-scale benchmarking of evolving agents—moving beyond small case studies for comparison at scale.

Feb 26, 2026 · 6 min read

LLM Defeated in Open-ended Problems

Modern LLMs claim superhuman algorithmic abilities, but what happens when there is no strict verifier? We analyze how multi-turn 'optimization' in Frontier-CS exposes the cognitive ceiling and catastrophic failures of AI in open-ended problem solving.

Feb 10, 2026 · 13 min read

Evaluating the Hardest CS Problems in the Age of LLMs

Frontier-CS scores solutions on a continuous scale across heterogeneous hardware. This post explains the evaluation architecture behind the leaderboard: hash-based resume, resource-grouped clusters, pinned environments, and the challenges ahead for agentic submissions.

Feb 3, 2026 · 5 min read

Frontier-CS 1.0 Release

We are releasing Frontier-CS 1.0, a major update to our open-ended Computer Science benchmark. This release expands Frontier-CS to 240 tasks across both the algorithmic and research tracks. We also introduce a new Elo-based leaderboard, along with full execution traces of model solutions to enable deeper analysis and reproducibility.