Greptile AI Code Review Tool - Automated GitHub PR Review Bot with Full Codebase Understanding
OpenAI o4-mini vs Sonnet 3.5: AI Bug Detection Compared

OpenAI o4-mini vs Sonnet 3.5: AI Bug Detection Compared

May 5, 2025 (1w ago)

Written by Everett Butler

Introduction

As software complexity continues to increase, effective bug detection has become a critical aspect of software reliability. AI-driven tools such as OpenAI's o4-mini and Anthropic's Sonnet 3.5 are at the forefront of automating the identification of intricate bugs. In this article, I'll present a comparative analysis of these two leading AI models, assessing their performance at detecting complex bugs across Python, TypeScript, Go, Rust, and Ruby.

The Evaluation Dataset

I wanted to have the dataset of bugs to cover multiple domains and languages. I picked sixteen domains, picked 2-3 self-contained programs for each domain, and used Cursor to generate each program in TypeScript, Ruby, Python, Go, and Rust.

ID

1distributed microservices platform
2event-driven simulation engine
3containerized development environment manager
4natural language processing toolkit
5predictive anomaly detection system
6decentralized voting platform
7smart contract development framework
8custom peer-to-peer network protocol
9real-time collaboration platform
10progressive web app framework
11webassembly compiler and runtime
12serverless orchestration platform
13procedural world generation engine
14ai-powered game testing framework
15multiplayer game networking engine
16big data processing framework
17real-time data visualization platform
18machine learning model monitoring system
19advanced encryption toolkit
20penetration testing automation framework
21iot device management platform
22edge computing framework
23smart home automation system
24quantum computing simulation environment
25bioinformatics analysis toolkit
26climate modeling and simulation platform
27advanced code generation ai
28automated code refactoring tool
29comprehensive developer productivity suite
30algorithmic trading platform
31blockchain-based supply chain tracker
32personal finance management ai
33advanced audio processing library
34immersive virtual reality development framework
35serverless computing optimizer
36distributed machine learning training framework
37robotic process automation rpa platform
38adaptive learning management system
39interactive coding education platform
40language learning ai tutor
41comprehensive personal assistant framework
42multiplayer collaboration platform

Next I cycled through and introduced a tiny bug in each one. The type of bug I chose to introduce had to be:

  1. A bug that a professional developer could reasonably introduce
  2. A bug that could easily slip through linters, tests, and manual code review

Some examples of bugs I introduced:

  1. Undefined `response` variable in the ensure block
  2. Not accounting for amplitude normalization when computing wave stretching on a sound sample
  3. Hard coded date which would be accurate in most, but not all situations

At the end of this, I had 210 programs, each with a small, difficult-to-catch, and realistic bug.

A disclaimer: these bugs are the hardest-to-catch bugs I could think of, and are not representative of the median bugs usually found in everyday software.

Results

Overall Bug Detection:

  • Anthropic Sonnet 3.5 detected 26 bugs.
  • OpenAI o4-mini identified 15 bugs.

Performance by Programming Language

Performance varied significantly across different languages:

  • Python: OpenAI o4-mini performed slightly better, detecting 5 bugs compared to Sonnet 3.5's 3. This suggests OpenAI’s strength in pattern recognition with well-documented languages.
  • TypeScript: Sonnet 3.5 significantly outperformed o4-mini, finding 5 bugs versus 2. This highlights Sonnet’s effective logical reasoning capability in strongly-typed languages.
  • Go: Sonnet 3.5 was notably stronger, identifying 8 bugs compared to o4-mini’s 1, showing significant advantage in detecting subtle concurrency and logical errors.
  • Rust: Both models were evenly matched, detecting 3 bugs each. The result suggests both models face challenges with Rust's complex type safety and ownership semantics.
  • Ruby: Sonnet 3.5 again demonstrated clear superiority by detecting 7 bugs compared to o4-mini’s 4, confirming its strength in dynamically-typed languages.

Analysis and Key Insights

The superior performance of Anthropic Sonnet 3.5 can largely be attributed to its reasoning-based architecture. Unlike OpenAI o4-mini, which relies more heavily on heuristic and pattern-based predictions, Sonnet 3.5 explicitly incorporates logical reasoning steps. This capability is especially valuable in languages with fewer available training patterns, such as Ruby and Go, where nuanced logical errors are more frequent and harder to detect through pattern matching alone.

In contrast, OpenAI o4-mini’s slightly better performance in Python indicates strengths in environments rich in training data, highlighting its capacity for rapid pattern recognition when dealing with common or widely recognized coding issues.

Highlighted Bug Example

One particularly insightful bug involved an incorrect implementation of a round-robin strategy in a Python-based data partitioning module (DataPartitioner). The bug arose from using a random distribution instead of a sequential approach:

Bug Description:
The get_partition method in the DataPartitioner class incorrectly utilized random.randint() rather than following the round-robin distribution logic, resulting in_


TRY GREPTILE TODAY

AI code reviewer that understands your codebase.

Merge 50-80% faster, catch up to 3X more bugs.

14-days free, no credit card required