Second Batch Benchmark

This document (PDF) describes our plan for a second batch of problems, which will be created, tested, and graded from March to June 2026. This batch will be designed as a formal benchmark. It will also include a separate round of informal community experimentation, in which a set of problems is made available to the interested public, with solutions provided after a few days, followed by an open discussion.

Timeline

Problem selection

March — May 2026

Mathematicians across fields submit unpublished problems with proofs of at most 8 pages. All submissions undergo a first round of refereeing, and 10 problems are selected for the benchmark plus a separate set for community experimentation.

Benchmark testing

Late May — Early June 2026

The editorial board tests AI systems via API. Each system is an open source harness which calls publicly available models, and gets one shot per question with no additional interaction.

Benchmark grading

June 2026

Human mathematicians referee AI solutions blind, rating each as essentially flawless, publishable with minor revisions, requiring major revisions, or rejected. Results — including referee reports and editorial reasoning — will be published online.

AI Systems Tested

A: IMProofBench `ProofCouncil`

Source code

TeamJohannes Schmitt, Tim Gehrunger, Jasper Dekoninck, Gergely Bérczi, Uri Kreitner, Liam Price

Base modelsgpt-5.5 pro (primary); gpt-5.5, gemini-3.1-pro-preview, claude-opus-4-7

B: UCLA Moonshot Harness

Source code

TeamJunyi Zhang, Xinjie He, Hyunsik Chae, Ethan Ji, Eric Jiang, Rushil Raghavan, Yiwen Kou, Alex Taylor, Kai-Wei Chang, Raghu Meka, Violet Peng, Amit Sahai, Terence Tao, Wei Wang

Base modelgpt-5.5 pro

C: OpenAI ChatGPT 5.5 Pro

Prompt

TeamSébastien Bubeck, Mehtaab Sawhney

Base modelgpt-5.5 pro

D: Princeton Momus

Source code

TeamSanjeev Arora, Liam Fowl

Base modelgemini-3.1-pro-preview

Results

Second Batch Benchmark Report

Second Batch Benchmark

Timeline

Problem selection

Benchmark testing

Benchmark grading

AI Systems Tested

A: IMProofBench ProofCouncil

B: UCLA Moonshot Harness

C: OpenAI ChatGPT 5.5 Pro

D: Princeton Momus

Results

Supplementary Documents

A: IMProofBench `ProofCouncil`