Greptile AI Code Review Tool - Automated GitHub PR Review Bot with Full Codebase Understanding
OpenAI 4o-mini vs Anthropic Sonnet 3.5: AI Bug Detection Compared

OpenAI 4o-mini vs Anthropic Sonnet 3.5: AI Bug Detection Compared

May 4, 2025 (1w ago)

Written by Everett Butler

Introduction

Artificial intelligence continues to play an increasingly important role in software development, particularly in automated bug detection. Traditional debugging methods can be time-consuming and often miss subtle, complex issues. To explore AI’s capabilities further, this article compares two advanced AI models—OpenAI's 4o-mini and Anthropic's Sonnet 3.5—evaluating their effectiveness in identifying challenging bugs across Python, TypeScript, Go, Rust, and Ruby. Let's dive into the findings and insights from this comparison.

The Evaluation Dataset

I wanted to have the dataset of bugs to cover multiple domains and languages. I picked sixteen domains, picked 2-3 self-contained programs for each domain, and used Cursor to generate each program in TypeScript, Ruby, Python, Go, and Rust.

ID

1distributed microservices platform
2event-driven simulation engine
3containerized development environment manager
4natural language processing toolkit
5predictive anomaly detection system
6decentralized voting platform
7smart contract development framework
8custom peer-to-peer network protocol
9real-time collaboration platform
10progressive web app framework
11webassembly compiler and runtime
12serverless orchestration platform
13procedural world generation engine
14ai-powered game testing framework
15multiplayer game networking engine
16big data processing framework
17real-time data visualization platform
18machine learning model monitoring system
19advanced encryption toolkit
20penetration testing automation framework
21iot device management platform
22edge computing framework
23smart home automation system
24quantum computing simulation environment
25bioinformatics analysis toolkit
26climate modeling and simulation platform
27advanced code generation ai
28automated code refactoring tool
29comprehensive developer productivity suite
30algorithmic trading platform
31blockchain-based supply chain tracker
32personal finance management ai
33advanced audio processing library
34immersive virtual reality development framework
35serverless computing optimizer
36distributed machine learning training framework
37robotic process automation rpa platform
38adaptive learning management system
39interactive coding education platform
40language learning ai tutor
41comprehensive personal assistant framework
42multiplayer collaboration platform

Next I cycled through and introduced a tiny bug in each one. The type of bug I chose to introduce had to be:

  1. A bug that a professional developer could reasonably introduce
  2. A bug that could easily slip through linters, tests, and manual code review

Some examples of bugs I introduced:

  1. Undefined `response` variable in the ensure block
  2. Not accounting for amplitude normalization when computing wave stretching on a sound sample
  3. Hard coded date which would be accurate in most, but not all situations

At the end of this, I had 210 programs, each with a small, difficult-to-catch, and realistic bug.

A disclaimer: these bugs are the hardest-to-catch bugs I could think of, and are not representative of the median bugs usually found in everyday software.

Results

Overall Performance

  • Anthropic Sonnet 3.5 successfully detected 26 bugs.
  • OpenAI 4o-mini identified 19 bugs.

These results underline the difficulty of the task, but also the promising potential AI holds for enhancing software verification practices.

Performance by Programming Language

The results varied significantly across languages:

  • Go:

    • Anthropic Sonnet 3.5: 8 bugs detected out of 42.
    • OpenAI 4o-mini: 3 bugs detected.
    • Insight: Sonnet 3.5’s superior performance here suggests an advantage in logical reasoning capabilities, especially valuable in concurrency-heavy languages like Go.
  • Python:

    • Anthropic Sonnet 3.5: 3 bugs detected out of 42.
    • OpenAI 4o-mini: 4 bugs detected.
    • Insight: OpenAI slightly outperformed, possibly due to its strength in pattern recognition within well-documented languages like Python.
  • TypeScript:

    • Anthropic Sonnet 3.5: 5 bugs detected out of 42.
    • OpenAI 4o-mini: 2 bugs detected.
    • Insight: Sonnet 3.5’s advantage suggests its deeper reasoning capability excels in strongly typed and structurally complex languages.
  • Rust:

    • Anthropic Sonnet 3.5: 3 bugs detected out of 41.
    • OpenAI 4o-mini: 4 bugs detected.
    • Insight: Both models showed similar effectiveness, though OpenAI 4o-mini had a slight edge, possibly benefiting from Rust’s clearly defined patterns.
  • Ruby:

    • Anthropic Sonnet 3.5: 7 bugs detected out of 42.
    • OpenAI 4o-mini: 6 bugs detected.
    • Insight: Sonnet 3.5 showed notable strength, demonstrating its capacity for logical inference in dynamically-typed environments.

Analysis and Key Insights

Anthropic Sonnet 3.5 generally outperformed OpenAI 4o-mini, particularly in languages with fewer standardized patterns or less abundant training data. This success can be attributed to Sonnet 3.5’s architectural emphasis on a reasoning phase before generating outputs, allowing it to interpret and logically deduce code behavior more effectively.

Conversely, OpenAI 4o-mini’s stronger performance in languages like Python and Rust highlights its reliance on rapid, pattern-based recognition, which works well with extensively documented, commonly encountered coding issues.

These differences underscore a crucial insight: integrating explicit reasoning processes into AI-driven bug detection can significantly enhance model performance, especially in contexts where mere pattern recognition is insufficient.

Highlighted Bug Example

An insightful example comes from a Ruby-based audio processing library, involving a subtle logic error in gain calculation, uniquely identified by Anthropic Sonnet 3.5:

Test Case: Ruby Bug #1 (Gain Calculation Error)

  • Sonnet 3.5 Reasoning Output:

    "The bug in this file is in the TimeStretchProcessor class, specifically how it calculates normalize_gain. It incorrectly uses a fixed formula without considering the stretch_factor. This oversight causes audio outputs to have incorrect amplitude levels. By logically reasoning through the relationship between the stretch_factor and gain adjustments, Sonnet 3.5 correctly identified this inconsistency."

This specific example emphasizes how Sonnet 3.5’s reasoning capability allows it to identify logical errors beyond simple syntactic or pattern-based checks, providing a deeper level of bug detection.

Conclusion

The comparative analysis illustrates the strengths and weaknesses of each model, highlighting Anthropic Sonnet 3.5’s impressive reasoning-based bug detection capabilities, especially valuable in less mainstream programming languages. As AI-driven code analysis evolves, integrating reasoning steps within traditional pattern-based architectures could significantly advance software verification practices, enhancing both reliability and developer productivity.


Want to improve your software quality using advanced AI-driven bug detection? Try Greptile today.


TRY GREPTILE TODAY

AI code reviewer that understands your codebase.

Merge 50-80% faster, catch up to 3X more bugs.

14-days free, no credit card required