AI Benchmarked with Super Mario Bros.: Claude 3.7 Leads, GPT-4o Struggles

AI tested with Super Mario Bros.: Claude 3.7 wins, GPT-4o struggles in real-time play.
Matilda
AI Benchmarked with Super Mario Bros.: Claude 3.7 Leads, GPT-4o Struggles
Thought Pokémon was a tough benchmark for AI? One group of researchers argues that Super Mario Bros. is even tougher. Image : Google Hao AI Lab, a research org at the University of California San Diego, on Friday threw AI into live Super Mario Bros. games. Anthropic’s Claude 3.7 performed the best, followed by Claude 3.5. Google’s Gemini 1.5 Pro and OpenAI’s GPT-4o struggled. It wasn’t quite the same version of Super Mario Bros. as the original 1985 release, to be clear. The game ran in an emulator and integrated with a framework, GamingAgent, to give the AIs control over Mario. GamingAgent, which Hao developed in-house, fed the AI basic instructions, like, “If an obstacle or enemy is near, move/jump left to dodge” and in-game screenshots. The AI then generated inputs in the form of Python code to control Mario. Still, Hao says that the game forced each model to “learn” to plan complex maneuvers and develop gameplay strategies. Interestingly, the lab found that reasoning models like OpenAI…