Blog Posts

Research notes, benchmark releases, and agent evaluation insights.

Your Next Long-Context Recipe: Open-Ended Problems

May 12, 2026 · 14 min read

Your Next Long-Context Recipe: Open-Ended Problems

Coding data lifts reasoning. Agentic coding is dominant. We introduce FrontierCS: 172 open-ended problems with continuous scoring, all one harbor run away. Kimi K2.6 and Claude Opus 4-7 go head-to-head, sustaining up to 456 turns, 405 tool calls, and 531K output tokens per problem.

Read more →
LLM Defeated in Open-ended Problems

Feb 26, 2026 · 6 min read

LLM Defeated in Open-ended Problems

Modern LLMs claim superhuman algorithmic abilities, but what happens when there is no strict verifier? We analyze how multi-turn 'optimization' in Frontier-CS exposes the cognitive ceiling and catastrophic failures of AI in open-ended problem solving.

Read more →
Evaluating the Hardest CS Problems in the Age of LLMs

Feb 10, 2026 · 13 min read

Evaluating the Hardest CS Problems in the Age of LLMs

Frontier-CS scores solutions on a continuous scale across heterogeneous hardware. This post explains the evaluation architecture behind the leaderboard: hash-based resume, resource-grouped clusters, pinned environments, and the challenges ahead for agentic submissions.

Read more →

Feb 3, 2026 · 5 min read

Frontier-CS 1.0 Release

We are releasing Frontier-CS 1.0, a major update to our open-ended Computer Science benchmark. This release expands Frontier-CS to 240 tasks across both the algorithmic and research tracks. We also introduce a new Elo-based leaderboard, along with full execution traces of model solutions to enable deeper analysis and reproducibility.

Read more →