Why Crowdsourced AI Benchmarks Are Flawed and What It Means for the Industry
Discover why experts are raising concerns about crowdsourced AI benchmarks, their ethical flaws, and how they impact model evaluations.
Matilda
Why Crowdsourced AI Benchmarks Are Flawed and What It Means for the Industry Why Are Crowdsourced AI Benchmarks Under Scrutiny? If you’ve been following advancements in artificial intelligence, you may have heard about crowdsourced AI benchmarks like Chatbot Arena. These platforms allow users to evaluate AI models by comparing outputs and voting on their preferences. While they seem like a democratic way to assess AI capabilities, experts warn that these benchmarks have serious flaws. According to linguistics professor Emily Bender, co-author of The AI Con , benchmarks must measure specific constructs with clear validity—a criterion many crowdsourced platforms fail to meet. For instance, does voting for one response over another truly reflect user preferences or model quality? Critics argue that these platforms risk being co-opted by AI labs to promote exaggerated claims, making them unreliable indicators of true progress. Image Credits: Carol Yepes / Getty Images