Claude Sonnet 4 vs. Sonnet 3.7: Which Model Catches More Bugs?

Written by Everett Butler

May 22, 2025
Scratch Paper
Claude Sonnet 4 vs. Sonnet 3.7: Which Model Catches More Bugs?

At Greptile, we’re building an AI-powered code review bot that catches subtle bugs and anti-patterns directly in pull requests. The quality of our reviews depends heavily on the reasoning capabilities of the underlying LLMs.

Bug detection isn’t just harder than code generation—it’s fundamentally different. When OpenAI released the reasoning-optimized o3-mini last year, we saw signs that reasoning-first models could outperform traditional LLMs at spotting hard-to-catch issues.

Now, with the release of Claude Sonnet 4.0, we wanted to test whether Anthropic’s latest reasoning model delivers a measurable leap over its predecessor, Sonnet 3.7.

The Evaluation Dataset

To ensure broad coverage, we created a dataset spanning 16 domains and five languages (TypeScript, Ruby, Python, Go, Rust). Using Cursor, we generated ~210 self-contained programs and manually injected subtle, realistic bugs into each.

Our goal was to simulate the kind of issues that experienced developers might introduce—and that linters, tests, and even manual review could easily miss. Some examples of the programs we created:

IDProgram Description
1distributed microservices platform
2event-driven simulation engine
3containerized development environment manager
19advanced encryption toolkit
23smart home automation system
24quantum computing simulation environment
26climate modeling and simulation platform
27advanced code generation ai
30algorithmic trading platform
31blockchain-based supply chain tracker
41comprehensive personal assistant framework
42multiplayer collaboration platform

Next we cycled through and introduced a tiny bug in each one. The type of bug we chose to introduce had to be:

  1. A bug that a professional developer could reasonably introduce
  2. A bug that could easily slip through linters, tests, and manual code review

Some examples of bugs we introduced:

  1. Undefined `response` variable in the ensure block
  2. Not accounting for amplitude normalization when computing wave stretching on a sound sample
  3. Hard coded date which would be accurate in most, but not all situations

At the end of this, we had 210 programs, each with a small, difficult-to-catch, and realistic bug.

A disclaimer: these bugs are the hardest-to-catch bugs we could think of, and are not representative of the median bugs usually found in everyday software.

Results

Despite expectations, Claude Sonnet 4.0 did not outperform Sonnet 3.7 in our bug detection benchmark. Both models caught roughly 14% of injected bugs, with only minor variations across programming languages.

This result was surprising: Sonnet 4.0 is built on a newer architecture, yet its detection ability remained on par with 3.7—suggesting that improvements may lie more in reasoning style than raw accuracy at this stage.

Here’s the breakdown by language:

LanguageClaude Sonnet 4.0Claude Sonnet 3.7
Go6/426/42
Python3/424/42
TypeScript9/429/42
Rust6/416/41
Ruby6/427/42
Total30/20932/209

While the totals were nearly identical, each model caught a slightly different subset of bugs. This suggests distinct internal heuristics or reasoning strategies, not just random variation. In later sections, we highlight a few of those differences with side-by-side examples.

Key Insights and Significant Overlap

A notable observation was the substantial overlap in bugs caught by both versions, demonstrating a solid consistency in Anthropic's underlying AI framework. Still, there were intriguing distinctions that suggest future opportunities for optimization and improvement.

Example Highlight: Advanced Encryption Toolkit (Test 19)

Correct Bug Identified by Sonnet 3.7:

Claude Sonnet 3.7 accurately pinpointed:

"The most critical bug is in the DataPartitioner.get_partition method, where the ROUND_ROBIN strategy incorrectly returns a random partition number instead of implementing true round-robin distribution. A correct implementation requires a sequential counter across partitions."

Alternative Issue Highlighted by Sonnet 4.0:

Claude Sonnet 4.0 focused instead on:

"The most critical bug is in the SchemaValidator.validate method, where math.isnan() is used without importing the math module earlier in the file, potentially causing a NameError."

While Sonnet 4.0 identified a legitimate error, it wasn't the most critical issue in this context. Nonetheless, this demonstrates Sonnet 4.0's capacity to detect additional errors that may have indirect or secondary impacts.

This case highlights a key theme: Sonnet 3.7 often prioritized core logic bugs, while 4.0 surfaced secondary issues. Both are valid, but signal differing attention strategies between models.

Optimistic Outlook for Sonnet 4.0

Claude Sonnet 4.0 hasn't yet outpaced its predecessor, but its parity with 3.7 right out of the gate is promising. If this model is a foundation for future iterations, Anthropic seems well-positioned to deliver major improvements in bug detection and code reasoning down the line.

We’re optimistic that as reasoning-first models evolve, software verification will become increasingly automated, reliable, and scalable—allowing developers to focus on creativity and architecture instead of chasing down subtle issues.