Did xAI Exaggerate Grok 3's AI Benchmark Results?

xAI's claims about Grok 3's superiority are under scrutiny as OpenAI accuses the company of misleading benchmark reporting.
Matilda
Did xAI Exaggerate Grok 3's AI Benchmark Results?
The world of artificial intelligence is a competitive one, with companies vying for the title of "smartest AI." Recently, Elon Musk's xAI threw its hat in the ring, claiming its latest model, Grok 3, outperforms OpenAI's leading models. But has xAI played fair in showcasing Grok 3's capabilities? xAI's blog post proudly displayed a graph illustrating Grok 3's performance on AIME 2025, a challenging math benchmark. At first glance, it appeared that Grok 3 had indeed surpassed OpenAI's o3-mini-high. However, OpenAI employees quickly cried foul, pointing out a crucial omission in xAI's analysis: the "cons@64" scoring method. What is cons@64? Imagine giving a student 64 attempts to solve a math problem and then selecting their most frequent answer. That's essentially what cons@64 does for AI models. It allows the model to "try" a problem multiple times and takes the most common answer as the final solution. This naturally leads …