Claude Opus 4.5 fails coding tests despite “best model” claims

According to ZDNet, Anthropic’s new Claude Opus 4.5 model claims to be “the best model in the world for coding, agents, and computer use,” but independent testing reveals it failed half of four standard coding challenges. The model crashed on a WordPress plugin test where download functionality didn’t work and the generated code was non-functional. It also failed a JavaScript currency validation test by rejecting valid inputs and crashing on edge cases. While it passed tests involving PHP framework debugging and AppleScript automation, the 50% failure rate directly contradicts Anthropic’s marketing claims about coding superiority.

The gap between claims and reality

Here’s the thing about AI coding assistants: they’re supposed to make developers’ lives easier, not create more debugging work. When a model can’t handle basic tasks like generating downloadable files or validating simple currency inputs, that’s a problem. And when that same model is being marketed as the absolute best in the world? That’s a credibility gap that could hurt developers who rely on these tools for real work.

What’s particularly interesting is that the author found better results with Claude‘s cheaper Sonnet model. That’s the opposite of what you’d expect. Usually, you pay more for better performance, right? But with Opus 4.5, you’re apparently paying premium prices for what amounts to unreliable output. For developers working on industrial applications where reliability is non-negotiable, this kind of inconsistency is simply unacceptable.

When the basics break down

The file handling issues alone are telling. Opus 4.5 couldn’t provide working download links, mixed documentation into code files, and required multiple attempts just to get usable code. This isn’t advanced AI magic – this is basic functionality that any coding assistant should handle smoothly. If a model can’t reliably deliver code in a usable format, what good is its coding ability?

And the WordPress plugin test was particularly brutal. After jumping through hoops to actually get the code, the plugin presented an interface but didn’t actually function. The randomize button did nothing. The clear button did nothing. Basically, you got the visual shell of an application without any of the functionality. That’s like getting a car with no engine – it might look right, but it’s not going anywhere.

Why this matters for real development

For enterprise development teams considering AI coding tools, reliability isn’t a nice-to-have – it’s essential. When you’re building business-critical applications or working with industrial systems, you can’t afford code that crashes on null values or rejects valid inputs. The JavaScript currency validation failure is exactly the kind of edge case that breaks production applications.

Think about it: if you’re processing financial transactions or working with industrial control systems, you need code that handles every possible input gracefully. When Opus 4.5’s solution crashed on empty values instead of returning sensible errors, that’s a red flag for any serious development work. Companies relying on industrial computing solutions need tools they can trust, not AI assistants that introduce new bugs.

Where do we go from here?

The author notes that Opus 4.5 might perform better in an agentic environment with human supervision, where you can iterate multiple times. But that’s the catch, isn’t it? If you need to send an AI back to the drawing board six or ten times to get working code, how much time are you actually saving?

I think the bigger question is whether any current AI model truly deserves the “best in the world” title for coding. Different models excel at different tasks, and real-world testing consistently shows that none are perfect. For developers following independent testing or checking hands-on demonstrations, the message is clear: verify claims with your own testing before committing to any AI coding tool for production work.