Meta Denies Boosting Llama 4 Benchmark Scores Amid AI Model Performance Concerns

Meta’s VP of generative AI responds to benchmark score controversy surrounding Llama 4 models, denying claims of test set training.
Matilda
Meta Denies Boosting Llama 4 Benchmark Scores Amid AI Model Performance Concerns
As someone deeply immersed in the AI ecosystem, I’ve been closely tracking the developments around Meta’s newly released Llama 4 models—Maverick and Scout. Over the weekend, a rumor began spreading like wildfire across X and Reddit, suggesting that Meta may have manipulated benchmark scores by training their models on test sets. That claim, if true, could shake trust in not only Meta's AI credibility but also the way benchmarks are perceived across the industry.  Image:Google Ahmad Al-Dahle, Meta’s VP of Generative AI, addressed the claims head-on via a post on X. He categorically denied the rumor, saying it was “simply not true” that Meta trained Llama 4 Maverick and Scout on benchmark test sets. For context, test sets are critical tools in evaluating a model's performance after training—using them for training can produce artificially inflated scores that don’t reflect real-world performance. This rumor originated from an anonymous post on a Chinese social platform, where a se…