Greptile AI Code Review Tool - Automated GitHub PR Review Bot with Full Codebase Understanding
Identifying Hard Bugs: OpenAI o4-mini vs. Anthropic Sonnet 3.7

Identifying Hard Bugs: OpenAI o4-mini vs. Anthropic Sonnet 3.7

May 1, 2025 (1w ago)

Written by Everett Butler

Introduction

Large Language Models (LLMs) have shown significant promise in software verification, particularly in identifying subtle bugs that might be overlooked during manual code review. In this study, we compare two leading AI models—OpenAI's o4-mini and Anthropic's Sonnet 3.7—to evaluate their effectiveness at detecting intricate software bugs across Python, TypeScript, Go, Rust, and Ruby. This evaluation aims to highlight their comparative strengths and pinpoint areas for further improvement.

The Evaluation Dataset

I wanted to have the dataset of bugs to cover multiple domains and languages. I picked sixteen domains, picked 2-3 self-contained programs for each domain, and used Cursor to generate each program in TypeScript, Ruby, Python, Go, and Rust.

ID

1distributed microservices platform
2event-driven simulation engine
3containerized development environment manager
4natural language processing toolkit
5predictive anomaly detection system
6decentralized voting platform
7smart contract development framework
8custom peer-to-peer network protocol
9real-time collaboration platform
10progressive web app framework
11webassembly compiler and runtime
12serverless orchestration platform
13procedural world generation engine
14ai-powered game testing framework
15multiplayer game networking engine
16big data processing framework
17real-time data visualization platform
18machine learning model monitoring system
19advanced encryption toolkit
20penetration testing automation framework
21iot device management platform
22edge computing framework
23smart home automation system
24quantum computing simulation environment
25bioinformatics analysis toolkit
26climate modeling and simulation platform
27advanced code generation ai
28automated code refactoring tool
29comprehensive developer productivity suite
30algorithmic trading platform
31blockchain-based supply chain tracker
32personal finance management ai
33advanced audio processing library
34immersive virtual reality development framework
35serverless computing optimizer
36distributed machine learning training framework
37robotic process automation rpa platform
38adaptive learning management system
39interactive coding education platform
40language learning ai tutor
41comprehensive personal assistant framework
42multiplayer collaboration platform

Next I cycled through and introduced a tiny bug in each one. The type of bug I chose to introduce had to be:

  1. A bug that a professional developer could reasonably introduce
  2. A bug that could easily slip through linters, tests, and manual code review

Some examples of bugs I introduced:

  1. Undefined `response` variable in the ensure block
  2. Not accounting for amplitude normalization when computing wave stretching on a sound sample
  3. Hard coded date which would be accurate in most, but not all situations

At the end of this, I had 210 programs, each with a small, difficult-to-catch, and realistic bug.

A disclaimer: these bugs are the hardest-to-catch bugs I could think of, and are not representative of the median bugs usually found in everyday software.

Results

Overall Bug Detection:

  • Anthropic Sonnet 3.7: 32 bugs detected
  • OpenAI o4-mini: 15 bugs detected

Performance by Programming Language

  • Python: Both models showed similar effectiveness, with OpenAI o4-mini detecting 5 bugs, slightly ahead of Sonnet 3.7’s 4 detections. This suggests OpenAI's strength in pattern recognition within a well-represented language.
  • TypeScript: Anthropic Sonnet 3.7 clearly excelled, detecting 9 bugs compared to OpenAI o4-mini’s 2, reflecting Sonnet’s superior logical reasoning capabilities in strongly typed environments.
  • Go: Sonnet 3.7 demonstrated a strong advantage, identifying 6 bugs versus OpenAI's single detection. This highlights Sonnet’s better handling of concurrent logic and synchronization issues.
  • Rust: Anthropic Sonnet 3.7 again led with 6 bugs detected, double the amount detected by OpenAI o4-mini (3), indicating Sonnet's capability in understanding Rust’s nuanced semantics.
  • Ruby: Sonnet 3.7 outperformed o4-mini, detecting 7 bugs compared to o4-mini’s 4, showcasing its reasoning strength in dynamically-typed languages.

Analysis and Insights

Anthropic Sonnet 3.7’s overall stronger performance can primarily be attributed to its structured planning and reasoning phase, allowing it to logically analyze complex code scenarios. This reasoning capability is particularly valuable in languages like Ruby and Go, where patterns might be less obvious and less frequently represented in training datasets.

On the other hand, OpenAI o4-mini’s comparative success in Python underscores its effective use of pattern matching within extensively trained, mainstream programming languages. However, its limited reasoning capabilities become evident in languages requiring deeper logical insight.

The results emphasize the importance of combining extensive language-specific training data with robust reasoning and logical analysis capabilities, particularly when designing AI models for automated bug detection.

Highlighted Bug Example

A particularly insightful case involved a subtle race condition in a Go-based smart home notification system:

Issue Description:
The critical bug was due to missing synchronization mechanisms around device updates in the notification broadcasting function. Without appropriate locking, concurrent updates could cause clients to receive outdated or partially updated device information.

Anthropic Sonnet 3.7 Reasoning:

"The most critical bug in this code is the absence of locking around device updates prior to broadcasting notifications, which creates potential race conditions. Clients might consequently receive stale or partially updated states."

Sonnet 3.7 successfully identified this subtle logic error, highlighting the advantage of its reasoning-based architecture. OpenAI o4-mini, lacking such detailed logical analysis, failed to detect this particular bug.

Conclusion

This comparative study underscores Anthropic Sonnet 3.7’s notable strengths in identifying intricate software bugs through logical reasoning. While OpenAI o4-mini performs effectively in pattern-heavy contexts like Python, Sonnet 3.7’s broader reasoning capabilities significantly enhance its effectiveness across less common and more complex coding environments.

Future advancements in LLM-based bug detection will likely depend on balancing extensive pattern-recognition training with robust logical reasoning processes, ultimately enhancing software quality and developer productivity.


Interested in AI-powered bug detection and improved code quality? Try Greptile today.


TRY GREPTILE TODAY

AI code reviewer that understands your codebase.

Merge 50-80% faster, catch up to 3X more bugs.

14-days free, no credit card required