Publications

CLEVER: A Curated Benchmark for Formally Verified Code Generation

Published in NeurIPS 2025 Datasets and Benchmarks Track, 2025

We introduce CLEVER, a high-quality, manually curated benchmark of 161 problems for end-to-end verified code generation in Lean.

Recommended citation: Amitayush Thakur, Jasper Lee, George Tsoukalas, Meghana Sistla, Matthew Zhao, Stefan Zetzsche, Greg Durrett, Yisong Yue, and Swarat Chaudhuri. CLEVER: A Curated Benchmark for Formally Verified Code Generation. In NeurIPS 2025 Datasets and Benchmarks Track, 2025. URL https://openreview.net/forum?id=IbOacMF5qd.
Download Paper

PutnamBench: Evaluating Neural Theorem-Provers on the Putnam Mathematical Competition

Published in NeurIPS 2024 Datasets and Benchmarks Track, 2024

We present PutnamBench, a new multi-language benchmark for evaluating the ability of neural theorem-provers to solve competition mathematics problems.

Recommended citation: George Tsoukalas, Jasper Lee, John Jennings, Jimmy Xin, Michelle Ding, Michael Jennings, Amitayush Thakur, and Swarat Chaudhuri. PutnamBench: Evaluating Neural Theorem-Provers on the Putnam Mathematical Competition. In NeurIPS 2024 Datasets and Benchmarks Track, 2024. URL https://openreview.net/forum?id=ChKCF75Ocd.
Download Paper

PutnamBench: A Multilingual Competition-Mathematics Benchmark for Formal Theorem-Proving

Published in AI for Math Workshop @ ICML 2024, 2024

We present PutnamBench, a new multilingual evaluation benchmark for formal theorem-proving.

Recommended citation: George Tsoukalas, Jasper Lee, John Jennings, Jimmy Xin, Michelle Ding, Michael Jennings, Amitayush Thakur, and Swarat Chaudhuri. PutnamBench: A Multilingual Competition-Mathematics Benchmark for Formal Theorem-Proving. In AI for Math Workshop @ ICML 2024, 2024. URL https://openreview.net/forum?id=vqW1VRFeVP.
Download Paper