Greptile AI Code Review Tool - Automated GitHub PR Review Bot with Full Codebase Understanding
OpenAI o3-mini vs Anthropic Sonnet 3.7 for Bug Detection

OpenAI o3-mini vs Anthropic Sonnet 3.7 for Bug Detection

April 28, 2025 (4d ago)

Written by Daksh Gupta

As AI code review tools become more widely adopted, a key question remains: how well can they catch real bugs — not just generate code, but spot logic flaws, edge cases, and race conditions?

In this post, we compare OpenAI’s o3-mini and Anthropic’s Sonnet 3.7, two compact but capable models. Both were tested head-to-head on a benchmark of hard-to-catch bugs spanning Python, TypeScript, Go, Rust, and Ruby.

🧪 The Evaluation Dataset

I wanted to have the dataset of bugs to cover multiple domains and languages. I picked sixteen domains, picked 2-3 self-contained programs for each domain, and used Cursor to generate each program in TypeScript, Ruby, Python, Go, and Rust.

Here are the programs we created for the evaluation:

ID

1distributed microservices platform
2event-driven simulation engine
3containerized development environment manager
4natural language processing toolkit
5predictive anomaly detection system
6decentralized voting platform
7smart contract development framework
8custom peer-to-peer network protocol
9real-time collaboration platform
10progressive web app framework
11webassembly compiler and runtime
12serverless orchestration platform
13procedural world generation engine
14ai-powered game testing framework
15multiplayer game networking engine
16big data processing framework
17real-time data visualization platform
18machine learning model monitoring system
19advanced encryption toolkit
20penetration testing automation framework
21iot device management platform
22edge computing framework
23smart home automation system
24quantum computing simulation environment
25bioinformatics analysis toolkit
26climate modeling and simulation platform
27advanced code generation ai
28automated code refactoring tool
29comprehensive developer productivity suite
30algorithmic trading platform
31blockchain-based supply chain tracker
32personal finance management ai
33advanced audio processing library
34immersive virtual reality development framework
35serverless computing optimizer
36distributed machine learning training framework
37robotic process automation rpa platform
38adaptive learning management system
39interactive coding education platform
40language learning ai tutor
41comprehensive personal assistant framework
42multiplayer collaboration platform

Next I cycled through and introduced a tiny bug in each one. The type of bug I chose to introduce had to be:

  1. A bug that a professional developer could reasonably introduce
  2. A bug that could easily slip through linters, tests, and manual code review

Some examples of bugs I introduced:

  1. Undefined \response\ variable in the ensure block
  2. Not accounting for amplitude normalization when computing wave stretching on a sound sample
  3. Hard coded date which would be accurate in most, but not all situations

At the end of this, I had 210 programs, each with a small, difficult-to-catch, and realistic bug.

A disclaimer: these bugs are the hardest-to-catch bugs I could think of, and are not representative of the median bugs usually found in everyday software.

📊 Results

Overall Bug Detection

  • OpenAI o3-mini: 37 bugs
  • Anthropic Sonnet 3.7: 32 bugs

While the total gap is modest, the language-level breakdown shows more nuance.

By Language

  • Python: o3-mini found 7, Sonnet 3.7 found 4
  • TypeScript: o3-mini found 7, Sonnet 3.7 found 9
  • Go: o3-mini found 7, Sonnet 3.7 found 6
  • Rust: o3-mini found 9, Sonnet 3.7 found 6
  • Ruby: Both models tied at 7

The two models are competitive. o3-mini leads slightly in Python and Rust. Sonnet 3.7 holds its own in TypeScript and Go, with Ruby coming out even.

🧠 Thoughts

Despite similar totals, OpenAI o3-mini outperformed Sonnet 3.7 overall — and was more consistent across all five languages.

That said, Sonnet 3.7 showed signs of stronger reasoning in specific edge cases, especially when dealing with concurrency and async behavior. Its performance in TypeScript — and competitive showing in Go — suggest that while it may not match o3-mini’s overall pattern recognition, it closes the gap through deeper logic handling.

This matches what we’ve seen in other benchmarks:

  • o3-mini shines in common languages with high training data coverage.
  • Sonnet 3.7 shows promise in languages and bug types where reasoning trumps memorization.

There’s no universal winner — just strengths that shift based on language, bug type, and model architecture.

🧩 A Bug Worth Highlighting

Go — Smart Home Race Condition

This bug appeared in a smart home notification system. A NotifyDeviceUpdate method broadcasted state changes — but failed to lock shared state first, creating a race condition.

  • Sonnet 3.7 caught it
  • o3-mini missed it

"The most critical bug is that there was no locking around device updates before broadcasting, which could lead to race conditions where clients receive stale or partially updated device state."

Sonnet’s output showed clear reasoning about concurrent access and mutation. It understood not just what the code did, but what could go wrong — and why that matters. This is the kind of logic bug that trips up models without reasoning steps.

✅ Final Thoughts

Both o3-mini and Sonnet 3.7 are capable bug detectors.

  • o3-mini performs better overall, especially in Python and Rust.
  • Sonnet 3.7 holds its own — and even outperforms in select cases — especially where bugs involve async logic or less-common control flow.

If you’re looking for the best all-around model today for AI code review, o3-mini is likely your top pick. But for teams tackling edge cases, concurrency, or lower-resource languages, Sonnet 3.7 brings unique value through deeper logic reasoning.


Greptile uses models like o3-mini and Sonnet 3.7 to catch real bugs in production PRs — before they hit prod. Want to see how it works on your codebase? Try Greptile — no credit card required.


TRY GREPTILE TODAY

AI code reviewer that understands your codebase.

Merge 50-80% faster, catch up to 3X more bugs.

14-days free, no credit card required