Anthropic’s AI Hiring Test Keeps Getting Outsmarted by Claude
Anthropic, the AI safety-focused company behind the Claude large language models, is facing a uniquely ironic problem: its own AI keeps passing its technical hiring exams. Since 2024, the company’s performance optimization team has used a take-home coding challenge to vet job applicants—but as Claude has grown smarter, so too has the risk of AI-assisted cheating. Now, even top human candidates can’t consistently outperform the model, prompting Anthropic to overhaul its assessment strategy entirely.
This isn’t just a quirky tech anecdote. It’s a real-world case study in how generative AI is reshaping professional evaluation—and why traditional hiring methods may no longer work in an era where AI co-pilots are smarter than many humans.
The Original Test Worked—Until Claude Got Too Good
When Anthropic first rolled out its take-home assignment, it was designed to measure deep systems knowledge: optimizing low-level code for speed and memory efficiency on constrained hardware. Applicants had a fixed time window to submit their best solution, and the results reliably separated strong engineers from the rest.
But everything changed with Claude Opus 4. According to Tristan Hume, lead of Anthropic’s performance optimization team, that version of Claude began routinely matching or exceeding the performance of most human applicants under the same time constraints. “That still allowed us to distinguish the strongest candidates,” Hume noted in a recent blog post—but only briefly.
Then came Claude Opus 4.5. Suddenly, even the very best human submissions were indistinguishable from the AI’s output. “Under the constraints of the take-home test, we no longer had a way to distinguish between the output of our top candidates and our most capable model,” Hume wrote.
The implications are stark: if you can’t tell whether a submission came from a brilliant engineer or a well-prompted AI, your hiring process loses its validity.
Why This Isn’t Just About Cheating
At first glance, this seems like a classic academic integrity issue—like students using ChatGPT to write essays. But in Anthropic’s case, the stakes are higher. These aren’t undergraduates submitting term papers; they’re senior engineers applying for roles that demand nuanced judgment, creative problem-solving, and deep systems intuition.
The real danger isn’t just that someone might use Claude to pass the test—it’s that the test itself no longer measures what it’s supposed to. If an AI can replicate the expected output without understanding the underlying trade-offs, then the assessment fails its core purpose: evaluating human expertise.
And because Anthropic allows remote, unsupervised submissions (a common practice in tech hiring), there’s no practical way to enforce a “no-AI” rule. Even if applicants swear they didn’t use assistance, the results speak for themselves: the line between human and AI output has blurred beyond recognition.
A New Approach: Make the Problem AI Can’t Solve (Yet)
Faced with this dilemma, Hume and his team didn’t double down on surveillance or proctoring. Instead, they redesigned the test around novelty and ambiguity—two areas where current AI still stumbles.
The new challenge shifts away from textbook-style optimization puzzles. Instead, it presents candidates with an open-ended, poorly specified problem involving emerging hardware constraints and unconventional performance metrics. Success now depends less on writing efficient code and more on asking the right questions, making informed assumptions, and iterating based on incomplete information.
“In other words,” Hume explains, “we made the test feel more like real engineering work—and less like a puzzle an AI could memorize or pattern-match.”
Early results suggest the new format works. Human candidates with strong systems experience thrive in the ambiguity, while even Claude Opus 4.5 produces generic or misaligned responses when faced with the lack of clear parameters.
The Bigger Picture: AI Is Rewriting the Rules of Expertise
Anthropic’s struggle reflects a broader shift across the tech industry. As AI tools become ubiquitous in software development—from GitHub Copilot to in-IDE code generators—the definition of “coding skill” is evolving. Writing correct, efficient code is no longer the sole benchmark; prompt engineering, system design, and debugging AI-generated output are becoming equally important.
But that raises a new question: if AI can do the coding, what should we be testing for in technical interviews?
Some companies are moving toward live pair-programming sessions, architecture whiteboarding, or portfolio reviews. Others, like Anthropic, are betting on problems that require contextual reasoning and iterative refinement—skills that remain stubbornly human.
For job seekers, the message is clear: don’t just learn to code. Learn to think like an engineer in a world where code writes itself.
Transparency as a Strategy
In a surprising move, Hume published the original test publicly alongside his blog post—not to shame past applicants, but to crowdsource better solutions. “We’re curious if anyone can design a take-home challenge that’s both meaningful for humans and resistant to current AI,” he wrote.
This openness aligns with Anthropic’s stated commitment to responsible AI development. Rather than treating the problem as a proprietary security issue, the company is inviting the community to help solve it—a rare display of humility from a firm building some of the world’s most advanced AI models.
It also underscores a deeper truth: no one has figured out how to fairly assess human ability in the age of superhuman AI. If even Anthropic can’t keep its own model from acing its hiring test, the rest of us shouldn’t feel bad about rethinking our evaluation methods.
What Comes Next for Technical Hiring?
Anthropic’s experience suggests that the future of technical hiring won’t rely on static coding challenges. Instead, assessments will likely emphasize:
- Problem scoping: Can the candidate clarify vague requirements?
- Trade-off analysis: Do they understand the real-world constraints beyond runtime complexity?
- Iterative refinement: Can they adapt their approach based on feedback or new data?
- Communication: Can they explain their decisions clearly to teammates?
These are all skills that AI can assist with—but not yet replicate end-to-end. And crucially, they mirror the day-to-day reality of engineering at high-performing teams.
As AI continues to advance, the gap between “what AI can do” and “what humans bring to the table” will keep shifting. Companies that adapt their hiring practices accordingly will find the best talent. Those that cling to outdated benchmarks risk filtering out exactly the kind of thinkers they need most.
The Irony No One Saw Coming
There’s a poetic irony in Anthropic—an AI safety pioneer—being outmaneuvered by its own creation. But rather than resisting the trend, the company is leaning into it, using the challenge as a catalyst for innovation in talent assessment.
For the rest of the tech world watching closely, the lesson is clear: if your hiring test can be solved by an AI, it’s probably not testing what matters anymore. The future belongs to those who can navigate ambiguity, ask better questions, and collaborate effectively—with both humans and machines.
And if you’re applying to Anthropic anytime soon? Don’t bother trying to cheat with Claude. Chances are, it’s already been tested—and failed.