Meta’s Maverick AI Benchmark Results Raise Eyebrows: Why the Public Version Doesn’t Match Arena Scores

Meta’s Maverick AI model ranked high on LM Arena, but discrepancies in its public version spark debate.
Matilda
Meta’s Maverick AI Benchmark Results Raise Eyebrows: Why the Public Version Doesn’t Match Arena Scores
Meta recently dropped its flagship Llama 4 models, including the widely discussed "Maverick," which is already stirring controversy. While it ranked second on LM Arena—a benchmark that involves human raters comparing AI outputs—the version tested isn’t the same as the one available to the public. And that’s a problem. Image:Google The version of Maverick Meta used on LM Arena is an “experimental chat version,” something Meta quietly admitted in its announcement. If you dig deeper, the Llama website clarifies that the model tested was “optimized for conversationality.” That alone makes the benchmark less valuable for anyone trying to evaluate the model’s true capabilities. We’ve known for a while that LM Arena isn’t the most robust benchmark around, but companies generally don't fine-tune or customize their entries specifically to game the ranking—or at least they haven’t owned up to it. Until now. Developers Are Downloading a Different AI Than They Were Sold It turns out tha…