Greptile AI Code Review Tool - Automated GitHub PR Review Bot with Full Codebase Understanding
OpenAI 4o-mini vs Sonnet 3.7: AI Bug Detection Compared

OpenAI 4o-mini vs Sonnet 3.7: AI Bug Detection Compared

May 5, 2025 (1w ago)

Written by Everett Butler

Introduction

The ongoing effort to enhance automated bug detection through advanced machine learning models has recently shifted attention towards leveraging Large Language Models (LLMs). Traditionally, LLMs have been primarily associated with code generation tasks, which differ significantly from bug detection—an area that demands deeper logical reasoning beyond pattern recognition. In this post, I'll explore and compare two prominent AI models, OpenAI 4o-mini and Anthropic Sonnet 3.7, to assess their effectiveness at identifying subtle software bugs across Python, TypeScript, Go, Rust, and Ruby.

The Evaluation Dataset

I wanted to have the dataset of bugs to cover multiple domains and languages. I picked sixteen domains, picked 2-3 self-contained programs for each domain, and used Cursor to generate each program in TypeScript, Ruby, Python, Go, and Rust.

ID

1distributed microservices platform
2event-driven simulation engine
3containerized development environment manager
4natural language processing toolkit
5predictive anomaly detection system
6decentralized voting platform
7smart contract development framework
8custom peer-to-peer network protocol
9real-time collaboration platform
10progressive web app framework
11webassembly compiler and runtime
12serverless orchestration platform
13procedural world generation engine
14ai-powered game testing framework
15multiplayer game networking engine
16big data processing framework
17real-time data visualization platform
18machine learning model monitoring system
19advanced encryption toolkit
20penetration testing automation framework
21iot device management platform
22edge computing framework
23smart home automation system
24quantum computing simulation environment
25bioinformatics analysis toolkit
26climate modeling and simulation platform
27advanced code generation ai
28automated code refactoring tool
29comprehensive developer productivity suite
30algorithmic trading platform
31blockchain-based supply chain tracker
32personal finance management ai
33advanced audio processing library
34immersive virtual reality development framework
35serverless computing optimizer
36distributed machine learning training framework
37robotic process automation rpa platform
38adaptive learning management system
39interactive coding education platform
40language learning ai tutor
41comprehensive personal assistant framework
42multiplayer collaboration platform

Next I cycled through and introduced a tiny bug in each one. The type of bug I chose to introduce had to be:

  1. A bug that a professional developer could reasonably introduce
  2. A bug that could easily slip through linters, tests, and manual code review

Some examples of bugs I introduced:

  1. Undefined `response` variable in the ensure block
  2. Not accounting for amplitude normalization when computing wave stretching on a sound sample
  3. Hard coded date which would be accurate in most, but not all situations

At the end of this, I had 210 programs, each with a small, difficult-to-catch, and realistic bug.

A disclaimer: these bugs are the hardest-to-catch bugs I could think of, and are not representative of the median bugs usually found in everyday software.

Results

Overall Performance:

  • Anthropic Sonnet 3.7: Detected 32 bugs out of 210.
  • OpenAI 4o-mini: Detected 20 bugs out of 210.

Performance by Programming Language:

  • Python: Both models performed equally, detecting 4 bugs each. The parity suggests strong pattern recognition capabilities in a widely-used language.
  • TypeScript: Anthropic Sonnet 3.7 significantly outperformed OpenAI, detecting 9 bugs compared to OpenAI’s 4. This indicates Anthropic's advantage in logical reasoning and handling TypeScript’s complexities.
  • Go: Anthropic Sonnet 3.7 detected twice as many bugs as OpenAI 4o-mini (6 vs 3), highlighting Sonnet's proficiency in handling concurrent and logical complexities inherent in Go.
  • Rust: Anthropic Sonnet 3.7 led with 6 bugs detected versus OpenAI's 4. Rust’s semantic nuances favor the reasoning capabilities of Sonnet’s model.
  • Ruby: Anthropic Sonnet 3.7 again outperformed OpenAI, identifying 7 bugs compared to OpenAI's 6, demonstrating better logical deduction capabilities in dynamically-typed languages.

Analysis and Key Insights

Anthropic Sonnet 3.7’s consistently superior performance—especially in TypeScript, Go, Rust, and Ruby—can largely be attributed to its built-in planning or "thinking" step. Unlike traditional pattern-based LLMs, Sonnet 3.7 explicitly engages in logical reasoning before generating responses. This approach significantly improves its ability to identify nuanced logical errors and subtle bugs in languages that may be less represented in training data.

Conversely, OpenAI 4o-mini’s comparable performance in Python emphasizes its strength in pattern recognition, driven by extensive training data and familiarity with common coding practices. While this approach proves efficient in well-documented environments, it falls short when facing languages requiring deeper logical evaluation.

The comparative results underscore the potential benefits of combining robust pattern recognition with logical reasoning capabilities to create more versatile and powerful bug detection tools.

Highlighted Bug Example

An illustrative example is a subtle logic error found uniquely by Anthropic Sonnet 3.7 within a Ruby audio processing library:

Test Case: Ruby Bug #1 (Gain Calculation in Audio Processing)

  • Anthropic Sonnet 3.7 Output:

    "The bug in the TimeStretchProcessor class stems from an incorrect calculation of normalize_gain. It failed to account for dynamic adjustments relative to stretch_factor, resulting in incorrect audio amplitude outputs. By logically evaluating dependencies between variables, Sonnet 3.7 identified this critical oversight."

This specific example highlights Anthropic’s strength in logical reasoning, effectively distinguishing between correct and faulty implementations—capabilities often missed by simpler, pattern-based methods.

Conclusion

Our analysis indicates that while OpenAI 4o-mini demonstrates effective performance in pattern-rich, widely-used languages, Anthropic Sonnet 3.7 excels in languages requiring deeper logical analysis, thanks to its explicit reasoning capabilities. Moving forward,


TRY GREPTILE TODAY

AI code reviewer that understands your codebase.

Merge 50-80% faster, catch up to 3X more bugs.

14-days free, no credit card required