Meta Faces Backlash for Benchmark Gaming with Experimental Llama 4 Maverick Model
Meta is under fire for using a tweaked Llama 4 model to top LM Arena's rankings. Here's what really happened—and what it means for AI benchmarks.
Matilda
Meta Faces Backlash for Benchmark Gaming with Experimental Llama 4 Maverick Model Meta’s Benchmark Controversy: A Closer Look at the Llama 4 Maverick Incident Earlier this week, I came across something that raised a lot of eyebrows in the AI community—Meta used an unreleased, experimental version of its Llama 4 Maverick model to achieve a top ranking on LM Arena, a popular crowdsourced benchmark. While high scores aren’t uncommon in the AI race, the way Meta achieved this result sparked an online storm—and for good reason. Let’s break down what happened, why it matters, and what this means for developers and the AI ecosystem at large. IMAGE CREDITS: RAFAEL HENRIQUE/SOPA IMAGES/LIGHTROCKET / GETTY IMAGES Meta’s Experimental Tweak: What Really Happened Meta submitted a version of its Llama 4 model named “Llama-4-Maverick-03-26-Experimental” to LM Arena, a platform where human raters compare outputs from different AI models and rank them based on performance. This version was not the same as the vanilla open-source release—it was optimized specifically for conversational ou…