Frontier-CS Goes Live: 2,000 Humans vs. AI on an Open-Ended Problem

We placed an open-ended optimization problem in Calico, UC Berkeley's official programming contest. To score, you had to beat a frontier AI. Out of 2,000+ contestants, only one submission surpassed the strongest AI agent.

Calico × Frontier CS
We placed an open-ended optimization problem in Calico, UC Berkeley's official competitive programming contest with 2,000+ human participants. The challenge: beat the AI. Out of all contestants, only one submission surpassed the strongest AI agent.

What is Calico?

Calico (Computer Algorithms and Logic in Competitive Operations) is the official competitive programming contest hosted by UC Berkeley’s Computer Science Mentors (CSM) at the end of each semester. It is co-organized with ICPC@Berkeley and runs on DOMjudge, the same open-source judge system used at the ICPC World Finals. Problems span graph theory, dynamic programming, combinatorics, and computational geometry, each authored and stress-tested by ICPC World Finalists and national olympiad medalists.

A typical contest sees nearly 100 in-person participants on Berkeley’s campus alongside over 2,000 online competitors from around the world, rivaling many national-level contests. Participants include top 0.1% competitive programmers on Codeforces, ICPC World Finalists, and IOI medalists.

In-person contestants competing during Calico at UC Berkeley.
The Calico team with tourist (Gennady Korotkevich), the world's #1 ranked competitive programmer.

The Problem

Frontier-CS is an open-ended, continuously-scored benchmark for AI on hard CS problems. This spring, Frontier-CS contributed one problem to CALICO. To score, you must beat an AI.

Given $M$ noisy observations of off-diagonal entries in an unknown $N \times N$ multiplication table, choose positive integers $a_1, a_2, \ldots, a_N$ and a discard set $S$ with at most $D$ elements to solve:

\[\min_{a_1, \ldots, a_N,\; S} \sum_{k \notin S} w_k \cdot \frac{|a_{r_k} \cdot a_{c_k} - v_k|}{v_k}\]

where each observation $k$ specifies a row index $r_k$, a column index $c_k$, a target value $v_k$, and an importance weight $w_k$.

The problem is NP-hard in general. In the largest test cases, $N = 4 \times 10^3$ and $M = 2 \times 10^6$, with a time limit of 10 seconds.

Left: the reconstructed multiplication table for $\mathbf{a} = [2, 3, 4, 4]$. Right: the equivalent graph formulation. The problem reduces to assigning node values that minimize weighted relative error across all edges.

Beat the AI

Rather than judging submissions against a fixed optimal answer, the problem uses a “beat the AI” scoring model. Three AI baselines define three difficulty levels. To clear a level, your penalty must be lower than the AI’s on every test case.

How hard is that? The AI baselines are not toy solutions. They get progressively more sophisticated:

Three AI baselines of increasing strength
The three AI baselines, in order of increasing strength.

Level 1: Grok 4, 121 lines. A greedy spanning-tree heuristic. Grok 4 picks one node as an anchor and propagates values outward, one edge at a time. Fast, but fragile: one bad anchor poisons the entire solution. It even has a floating-point bug that silently disables its confidence metric.

Level 2: GPT-5.4-thinking-high, 377 lines. A weighted-median coordinate descent solver with multi-start initialization. Unlike Grok 4, it uses all observations simultaneously and restarts from three different seeds. After convergence, the worst outliers are discarded and the solver re-runs.

Level 3: GPT-5.4-thinking-high + agentic harness, 533 lines. A six-stage pipeline produced by an open-ended solver agent co-designed by Bo Peng and Qiuyang Mang. The agent iteratively generates, executes, evaluates, and refines solutions over multiple rounds. The final pipeline chains log-domain linearization, robust relaxation, multi-start seeding, real-domain median sweep, integer hill-climbing, and iterative outlier rejection. Each stage hands a better-structured problem to the next.

Takeaway: Stronger models produce longer code (121 → 377 → 533 lines) and deeper pipelines. But the leap to Level 3 is not about a smarter model. It is about long-horizon agentic coding: iterating over many rounds of generation, execution, and refinement to assemble a solution no single-shot model can match.

Results

The contest lasted 3 hours with 12 problems in total, so contestants could only devote a fraction of their time to this problem. We also kept the judge open for one week after the contest to collect additional submissions. The results below include both in-contest and post-contest attempts.

Out of 285 total submissions, only one surpassed the strongest AI agent (Level 3).

Results: 21/285 beat Level 1, 7/285 beat Level 2, 1/285 beat Level 3

The human solutions that cleared Level 2 are strikingly different from the AI baselines. Two examples: one team used a square-root heuristic that simply estimates each $a_i \approx \sqrt{v_k}$ and averages, ignoring the graph structure entirely. Another team ran 7 passes of coordinate descent seeded from medians, essentially a simpler version of GPT-5.4’s approach without multi-start or interleaved outlier rejection.

Takeaway: Human submissions average only 125 lines, with no deep pipelines or stacked tricks. Beating an AI baseline does not require a more complex solution, just a different one. Where AI builds elaborate multi-stage architectures, humans find short, direct paths that exploit structural weaknesses the AI overlooked.

Conclusion

The agentic harness beat nearly every human, but at a cost: hundreds of tokens and a 533-line pipeline. Humans averaged just 125 lines. Can we learn from how humans iteratively refine solutions to open-ended problems, and use those trajectories to teach agents better long-horizon coding strategies?

We are open-sourcing everything from this blog:
Problem, test cases, AI solutions, and evaluation script
All human submissions from Calico Spring '26

We look forward to more collaborations between Calico and Frontier-CS. Stay tuned, and join us on Discord.