We placed an open-ended optimization problem in Calico, UC Berkeley's official programming contest. To score, you had to beat a frontier AI. Out of 2,000+ contestants, only one submission surpassed the strongest AI agent.
Calico (Computer Algorithms and Logic in Competitive Operations) is the official competitive programming contest hosted by UC Berkeley’s Computer Science Mentors (CSM) at the end of each semester. It is co-organized with ICPC@Berkeley and runs on DOMjudge, the same open-source judge system used at the ICPC World Finals. Problems span graph theory, dynamic programming, combinatorics, and computational geometry, each authored and stress-tested by ICPC World Finalists and national olympiad medalists.
A typical contest sees nearly 100 in-person participants on Berkeley’s campus alongside over 2,000 online competitors from around the world, rivaling many national-level contests. Participants include top 0.1% competitive programmers on Codeforces, ICPC World Finalists, and IOI medalists.
Given $M$ noisy observations of off-diagonal entries in an unknown $N \times N$ multiplication table, choose positive integers $a_1, a_2, \ldots, a_N$ and a discard set $S$ with at most $D$ elements to solve:
\[\min_{a_1, \ldots, a_N,\; S} \sum_{k \notin S} w_k \cdot \frac{|a_{r_k} \cdot a_{c_k} - v_k|}{v_k}\]where each observation $k$ specifies a row index $r_k$, a column index $c_k$, a target value $v_k$, and an importance weight $w_k$.
The problem is NP-hard in general. In the largest test cases, $N = 4 \times 10^3$ and $M = 2 \times 10^6$, with a time limit of 10 seconds.
Rather than judging submissions against a fixed optimal answer, the problem uses a “beat the AI” scoring model. Three AI baselines define three difficulty levels. To clear a level, your penalty must be lower than the AI’s on every test case.
How hard is that? The AI baselines are not toy solutions. They get progressively more sophisticated:
Level 1: Grok 4, 121 lines. A greedy spanning-tree heuristic. Grok 4 picks one node as an anchor and propagates values outward, one edge at a time. Fast, but fragile: one bad anchor poisons the entire solution. It even has a floating-point bug that silently disables its confidence metric.
Level 2: GPT-5.4-thinking-high, 377 lines. A weighted-median coordinate descent solver with multi-start initialization. Unlike Grok 4, it uses all observations simultaneously and restarts from three different seeds. After convergence, the worst outliers are discarded and the solver re-runs.
Level 3: GPT-5.4-thinking-high + agentic harness, 533 lines. A six-stage pipeline produced by an open-ended solver agent co-designed by Bo Peng and Qiuyang Mang. The agent iteratively generates, executes, evaluates, and refines solutions over multiple rounds. The final pipeline chains log-domain linearization, robust relaxation, multi-start seeding, real-domain median sweep, integer hill-climbing, and iterative outlier rejection. Each stage hands a better-structured problem to the next.
The contest lasted 3 hours with 12 problems in total, so contestants could only devote a fraction of their time to this problem. We also kept the judge open for one week after the contest to collect additional submissions. The results below include both in-contest and post-contest attempts.
Out of 285 total submissions, only one surpassed the strongest AI agent (Level 3).

The human solutions that cleared Level 2 are strikingly different from the AI baselines. Two examples: one team used a square-root heuristic that simply estimates each $a_i \approx \sqrt{v_k}$ and averages, ignoring the graph structure entirely. Another team ran 7 passes of coordinate descent seeded from medians, essentially a simpler version of GPT-5.4’s approach without multi-start or interleaved outlier rejection.
The agentic harness beat nearly every human, but at a cost: hundreds of tokens and a 533-line pipeline. Humans averaged just 125 lines. Can we learn from how humans iteratively refine solutions to open-ended problems, and use those trajectories to teach agents better long-horizon coding strategies?
We look forward to more collaborations between Calico and Frontier-CS. Stay tuned, and join us on Discord.