ARC-AGI-2: New AI Intelligence Test Stumps Leading Models, Exposing Limitations

The Arc Prize Foundation's ARC-AGI-2 test challenges AI models like GPT-4.5 and Claude 3.7, revealing major gaps in general intelligence.
Matilda
ARC-AGI-2: New AI Intelligence Test Stumps Leading Models, Exposing Limitations
The Arc Prize Foundation, a nonprofit co-founded by prominent AI researcher François Chollet, announced in a blog post on Monday that it has created a new, challenging test to measure the general intelligence of leading AI models. Image:Boris SV / Getty Images So far, the new test, called ARC-AGI-2, has stumped most models. “Reasoning” AI models like OpenAI’s o1-pro and DeepSeek’s R1 score between 1% and 1.3% on ARC-AGI-2, according to the Arc Prize leaderboard. Powerful non-reasoning models, including GPT-4.5, Claude 3.7 Sonnet, and Gemini 2.0 Flash, score around 1%. The ARC-AGI tests consist of puzzle-like problems where an AI has to identify visual patterns from a collection of different-colored squares and generate the correct “answer” grid. The problems were designed to force an AI to adapt to new problems it hasn’t seen before. The Arc Prize Foundation had over 400 people take ARC-AGI-2 to establish a human baseline. On average, “panels” of these people got 60% of the test’s questio…