Second Batch Benchmark
This document (PDF) describes our plan for a second batch of problems, which will be created, tested, and graded from March to June 2026. This batch will be designed as a formal benchmark. It will also include a separate round of informal community experimentation, in which a set of problems is made available to the interested public, with solutions provided after a few days, followed by an open discussion.
Timeline
Problem selection
Mathematicians across fields submit unpublished problems with proofs of at most 8 pages. All submissions undergo a first round of refereeing, and 10 problems are selected for the benchmark plus a separate set for community experimentation.
Benchmark testing
The editorial board tests AI systems via API. Each system is an open source harness which calls publicly available models, and gets one shot per question with no additional interaction.
Benchmark grading
Human mathematicians referee AI solutions blind, rating each as essentially flawless, publishable with minor revisions, requiring major revisions, or rejected. Results — including referee reports and editorial reasoning — will be published online.