Second Batch Benchmark

This document (PDF) describes our plan for a second batch of problems, which will be created, tested, and graded from March to June 2026. This batch will be designed as a formal benchmark. It will also include a separate round of informal community experimentation, in which a set of problems is made available to the interested public, with solutions provided after a few days, followed by an open discussion.

Timeline

Problem selection

Mathematicians across fields submit unpublished problems with proofs of at most 8 pages. All submissions undergo a first round of refereeing, and 10 problems are selected for the benchmark plus a separate set for community experimentation.

Benchmark testing

The editorial board tests AI systems via API. Each system is an open source harness which calls publicly available models, and gets one shot per question with no additional interaction.

Benchmark grading

Human mathematicians referee AI solutions blind, rating each as essentially flawless, publishable with minor revisions, requiring major revisions, or rejected. Results — including referee reports and editorial reasoning — will be published online.

AI Systems Tested

A: IMProofBench ProofCouncil

Source code

TeamJohannes Schmitt, Tim Gehrunger, Jasper Dekoninck, Gergely Bérczi, Uri Kreitner, Liam Price

Base modelsgpt-5.5 pro (primary); gpt-5.5, gemini-3.1-pro-preview, claude-opus-4-7

B: UCLA Moonshot Harness

Source code

TeamAmit Sahai, Terence Tao, Raghu Meka, Kai-Wei Chang, Nanyun (Violet) Peng, Wei Wang, Junya Zhang

Base modelgpt-5.5 pro

C: OpenAI ChatGPT 5.5 Pro

Prompt

TeamSébastien Bubeck, Mehtaab Sawhney

Base modelgpt-5.5 pro

D: Princeton Momus

Source code

TeamSanjeev Arora, Liam Fowl

Base modelgemini-3.1-pro-preview