April 8, 2025

Meta Fakes Llama 4 Benchmark

Meta allegedly inflated the benchmark results for its Llama 4 models. In a blog post, they claimed an ELO score of 1417, surpassing GPT-4 and just below Google’s Gemini 2.5 Pro. However, testers noticed the version used, Llama 4 Maverick optimized for conversationality, differed from the publicly available model. In response, LMArena has updated its rules to ensure fairer and more reproducible results.