Frontier-CS Blog Posts

Jun 16, 2026 · 11 min read

Humans Still Beat AI in the Long Horizon: Revisiting Test-Time Scaling in the Agent Era

Agents can spend test-time compute by trying, observing, and revising. We derive an Elo reference for repeated sampling, then show that in a 2022 two-week coding marathon, current agents plateau within 24 hours while top humans keep improving.

Jun 11, 2026 · 8 min read

Roadmap to FrontierCS 2.0

FrontierCS 2.0 extends open-ended evaluation from static sandboxed tasks to Harbor-based feedback loops and repo-level environments.

May 15, 2026 · 8 min read

FrontierSmith: Synthesizing Open-Ended Coding Problems at Scale

We release FrontierSmith, a system that converts closed-ended coding problems into open-ended optimization tasks for training long-horizon coding agents.

May 12, 2026 · 8 min read

Your Next Long-Context Recipe: Open-Ended Problems

We integrate FrontierCS into Harbor and release a preview long-horizon agent leaderboard on 178 open-ended algorithmic tasks. Kimi K2.6 and Claude Code Opus 4.7 show similar headline capability, but very different failure modes.

Apr 21, 2026 · 10 min read

Frontier-CS Goes Live: 2,000 Humans vs. AI on an Open-Ended Problem

We placed an open-ended optimization problem in CALICO, UC Berkeley's official programming contest. To score, you had to beat a frontier AI. Out of 2,000+ contestants, only one submission surpassed the strongest AI agent.

Mar 17, 2026 · 7 min read

Auto Building Agent Memory for Open Algorithm Problems: Evaluating ALMA on Frontier CS

Frontier-CS is an open benchmark that facilitates evaluation of evolving agents. We demonstrate how researchers build ALMA to automatically learn continual learning AI researchers on Frontier CS.

Mar 10, 2026 · 12 min read

Evaluating Evolving Agent Systems at Scale with Frontier-CS

Evolving agent systems are advancing fast, but evaluation hasn't kept up. We show how Frontier-CS enables comprehensive, large-scale benchmarking of evolving agents—moving beyond small case studies for comparison at scale.

Feb 26, 2026 · 6 min read

LLM Defeated in Open-ended Problems

Modern LLMs claim superhuman algorithmic abilities, but what happens when there is no strict verifier? We analyze how multi-turn 'optimization' in Frontier-CS exposes the cognitive ceiling and catastrophic failures of AI in open-ended problem solving.

Feb 10, 2026 · 13 min read

Evaluating the Hardest CS Problems in the Age of LLMs

Frontier-CS scores solutions on a continuous scale across heterogeneous hardware. This post explains the evaluation architecture behind the leaderboard: hash-based resume, resource-grouped clusters, pinned environments, and the challenges ahead for agentic submissions.

Feb 3, 2026 · 5 min read

Frontier-CS 1.0 Release

We are releasing Frontier-CS 1.0, a major update to our open-ended Computer Science benchmark. This release expands Frontier-CS to 240 tasks across both the algorithmic and research tracks. We also introduce a new Elo-based leaderboard, along with full execution traces of model solutions to enable deeper analysis and reproducibility.