At Greptile, we’re building an AI-powered code review bot that catches subtle bugs and anti-patterns directly in pull requests. The quality of our reviews depends heavily on the reasoning capabilities of the underlying LLMs.
Bug detection isn’t just harder than code generation—it’s fundamentally different. When OpenAI released the reasoning-optimized o3-mini last year, we saw signs that reasoning-first models could outperform traditional LLMs at spotting hard-to-catch issues.
Now, with the release of Claude Sonnet 4.0, we wanted to test whether Anthropic’s latest reasoning model delivers a measurable leap over its predecessor, Sonnet 3.7.
The Evaluation Dataset
To ensure broad coverage, we created a dataset spanning 16 domains and five languages (TypeScript, Ruby, Python, Go, Rust). Using Cursor, we generated ~210 self-contained programs and manually injected subtle, realistic bugs into each.
Our goal was to simulate the kind of issues that experienced developers might introduce—and that linters, tests, and even manual review could easily miss. Some examples of the programs we created:
ID | Program Description |
---|---|
1 | distributed microservices platform |
2 | event-driven simulation engine |
3 | containerized development environment manager |
19 | advanced encryption toolkit |
23 | smart home automation system |
24 | quantum computing simulation environment |
26 | climate modeling and simulation platform |
27 | advanced code generation ai |
30 | algorithmic trading platform |
31 | blockchain-based supply chain tracker |
41 | comprehensive personal assistant framework |
42 | multiplayer collaboration platform |
Next we cycled through and introduced a tiny bug in each one. The type of bug we chose to introduce had to be:
- A bug that a professional developer could reasonably introduce
- A bug that could easily slip through linters, tests, and manual code review
Some examples of bugs we introduced:
- Undefined `response` variable in the ensure block
- Not accounting for amplitude normalization when computing wave stretching on a sound sample
- Hard coded date which would be accurate in most, but not all situations
At the end of this, we had 210 programs, each with a small, difficult-to-catch, and realistic bug.
A disclaimer: these bugs are the hardest-to-catch bugs we could think of, and are not representative of the median bugs usually found in everyday software.
Results
Despite expectations, Claude Sonnet 4.0 did not outperform Sonnet 3.7 in our bug detection benchmark. Both models caught roughly 14% of injected bugs, with only minor variations across programming languages.
This result was surprising: Sonnet 4.0 is built on a newer architecture, yet its detection ability remained on par with 3.7—suggesting that improvements may lie more in reasoning style than raw accuracy at this stage.
Here’s the breakdown by language:
Language | Claude Sonnet 4.0 | Claude Sonnet 3.7 |
---|---|---|
Go | 6/42 | 6/42 |
Python | 3/42 | 4/42 |
TypeScript | 9/42 | 9/42 |
Rust | 6/41 | 6/41 |
Ruby | 6/42 | 7/42 |
Total | 30/209 | 32/209 |
While the totals were nearly identical, each model caught a slightly different subset of bugs. This suggests distinct internal heuristics or reasoning strategies, not just random variation. In later sections, we highlight a few of those differences with side-by-side examples.
Key Insights and Significant Overlap
A notable observation was the substantial overlap in bugs caught by both versions, demonstrating a solid consistency in Anthropic's underlying AI framework. Still, there were intriguing distinctions that suggest future opportunities for optimization and improvement.
Example Highlight: Advanced Encryption Toolkit (Test 19)
Correct Bug Identified by Sonnet 3.7:
Claude Sonnet 3.7 accurately pinpointed:
"The most critical bug is in the
DataPartitioner.get_partition
method, where the ROUND_ROBIN strategy incorrectly returns a random partition number instead of implementing true round-robin distribution. A correct implementation requires a sequential counter across partitions."
Alternative Issue Highlighted by Sonnet 4.0:
Claude Sonnet 4.0 focused instead on:
"The most critical bug is in the
SchemaValidator.validate
method, wheremath.isnan()
is used without importing themath
module earlier in the file, potentially causing aNameError
."
While Sonnet 4.0 identified a legitimate error, it wasn't the most critical issue in this context. Nonetheless, this demonstrates Sonnet 4.0's capacity to detect additional errors that may have indirect or secondary impacts.
This case highlights a key theme: Sonnet 3.7 often prioritized core logic bugs, while 4.0 surfaced secondary issues. Both are valid, but signal differing attention strategies between models.
Optimistic Outlook for Sonnet 4.0
Claude Sonnet 4.0 hasn't yet outpaced its predecessor, but its parity with 3.7 right out of the gate is promising. If this model is a foundation for future iterations, Anthropic seems well-positioned to deliver major improvements in bug detection and code reasoning down the line.
We’re optimistic that as reasoning-first models evolve, software verification will become increasingly automated, reliable, and scalable—allowing developers to focus on creativity and architecture instead of chasing down subtle issues.