A set of ten math questions to evaluate the capabilities of AI systems to autonomously solve problems that arise naturally in the research process.
Frequently Asked Questions
What constitutes a solution?
We consider that an AI model has answered one of our questions if it can produce in an autonomous way a proof that conforms to the levels of rigor and scholarship prevailing in the mathematics literature. In particular, the AI should not rely on human input for any mathematical idea or content, or to help it isolate the core of the problem. Citations should include precise statement numbers and should either be to articles published in peer-reviewed journals or to arXiv preprints.
Will you be assessing the correctness of solutions generated by AI systems?
No. As we write in the preprint, "our question list should not be considered a benchmark in its current form." We will shortly initiate a future round with an assessment procedure in place in which we both verify that solutions are autonomously produced and remove data contamination issues. If you are interested in having your AI system participate by submitting solutions to problems in subsequent rounds, please contact us at contact@1stproof.org.
At the same time, we believe that the experimentation that is being generated during this round is valuable. We encourage the community to discuss the autonomy and quality of solutions that have been produced, based on the criteria specified in Q1. Such public discussions would ideally take place after February 13, in order to minimize the risk of data contamination for other community members who are experimenting with our questions.
I have used a model to answer a question. What should I do?
Feel free to advertise that you have answered it on social media using the hashtag #1stProof.
You write that "the best publicly available AI systems struggle to answer many of our questions." Does that mean that they can answer some of them?
These questions illustrate the boundary of what state-of-the-art models are able to do. As such, it also includes questions that some models have been able to answer successfully.
Hasn't question 1 already been answered?
A short note with a very rough sketch of proof was posted on Hairer's website some years ago. Since we are asking for levels of scholarship equivalent to a mathematics research paper, a successful answer would involve filling in the gaps in the argument.