Can AI Outperform Humans in Scientific Research?

OpenAI’s PaperBench measures how well AI replicates scientific studies. In a test of 20 machine learning papers, human PhD students achieved a 41.4% replication rate, nearly twice as much as the best AI model (Claude 3.5 Sonnet) at 21%, while GPT-4o lagged behind at 4.1%.

The study highlights key differences: AI generates code quickly but struggles with refinement, whereas humans take longer to start but steadily improve. To advance research in this area, OpenAI has released PaperBench as an open-source tool on GitHub.