Greptile AI Code Review Tool - Automated GitHub PR Review Bot with Full Codebase Understanding
OpenAI 4o vs Anthropic Sonnet 3.5: AI Models Compared on Bug Detection

OpenAI 4o vs Anthropic Sonnet 3.5: AI Models Compared on Bug Detection

May 1, 2025 (1w ago)

Written by Everett Butler

Introduction

Identifying subtle, complex bugs remains a persistent challenge in software development. AI-powered code review tools have recently emerged as promising solutions, potentially revolutionizing how developers detect and resolve tricky software errors. In this post, I'll compare two leading AI models—OpenAI 4o and Anthropic Sonnet 3.5—to determine which performs better at detecting hard-to-find software bugs across Python, Go, TypeScript, Rust, and Ruby.

The Evaluation Dataset

I wanted to have the dataset of bugs to cover multiple domains and languages. I picked sixteen domains, picked 2-3 self-contained programs for each domain, and used Cursor to generate each program in TypeScript, Ruby, Python, Go, and Rust.

ID

1distributed microservices platform
2event-driven simulation engine
3containerized development environment manager
4natural language processing toolkit
5predictive anomaly detection system
6decentralized voting platform
7smart contract development framework
8custom peer-to-peer network protocol
9real-time collaboration platform
10progressive web app framework
11webassembly compiler and runtime
12serverless orchestration platform
13procedural world generation engine
14ai-powered game testing framework
15multiplayer game networking engine
16big data processing framework
17real-time data visualization platform
18machine learning model monitoring system
19advanced encryption toolkit
20penetration testing automation framework
21iot device management platform
22edge computing framework
23smart home automation system
24quantum computing simulation environment
25bioinformatics analysis toolkit
26climate modeling and simulation platform
27advanced code generation ai
28automated code refactoring tool
29comprehensive developer productivity suite
30algorithmic trading platform
31blockchain-based supply chain tracker
32personal finance management ai
33advanced audio processing library
34immersive virtual reality development framework
35serverless computing optimizer
36distributed machine learning training framework
37robotic process automation rpa platform
38adaptive learning management system
39interactive coding education platform
40language learning ai tutor
41comprehensive personal assistant framework
42multiplayer collaboration platform

Next I cycled through and introduced a tiny bug in each one. The type of bug I chose to introduce had to be:

  1. A bug that a professional developer could reasonably introduce
  2. A bug that could easily slip through linters, tests, and manual code review

Some examples of bugs I introduced:

  1. Undefined `response` variable in the ensure block
  2. Not accounting for amplitude normalization when computing wave stretching on a sound sample
  3. Hard coded date which would be accurate in most, but not all situations

At the end of this, I had 210 programs, each with a small, difficult-to-catch, and realistic bug.

A disclaimer: these bugs are the hardest-to-catch bugs I could think of, and are not representative of the median bugs usually found in everyday software.

Results

Anthropic Sonnet 3.5 outperformed OpenAI 4o, successfully identifying 26 bugs compared to 20 identified by OpenAI 4o.

Performance by Language

  • Go: Anthropic Sonnet 3.5 detected twice as many bugs as OpenAI 4o (8 vs. 4 out of 42). Its reasoning capability likely helped identify subtle concurrency and synchronization issues in Go.

  • Python: OpenAI 4o performed better, catching 6 bugs compared to Sonnet 3.5's 3. Python’s extensive training data and familiar patterns likely benefited OpenAI’s pattern-matching approach.

  • TypeScript: Performance was similar, with Anthropic Sonnet 3.5 finding 5 bugs, narrowly outperforming OpenAI 4o, which found 4. This reflects comparable pattern recognition and reasoning capabilities in strongly typed languages.

  • Rust: Both models performed equally, detecting 3 out of 42 bugs. Rust’s structured and safety-oriented codebase may equally suit both pattern-based and reasoning approaches.

  • Ruby: Anthropic Sonnet 3.5 significantly outperformed OpenAI 4o, identifying 7 bugs compared to just 3 by OpenAI 4o. Ruby’s dynamic typing and complex logic flow favored Anthropic’s reasoning-focused architecture.

Analysis and Insights

The differences between OpenAI 4o and Anthropic Sonnet 3.5 underscore how varied AI architectures and training methods influence bug detection performance. Sonnet 3.5's reasoning capabilities excelled in languages with less straightforward pattern matching or less training data (like Ruby and Go), indicating that logical inference can significantly enhance bug detection in certain contexts.

Conversely, OpenAI 4o’s strength in Python emphasizes how extensive training datasets and pattern recognition are advantageous for widely used languages.

These insights suggest the future of AI bug detection tools lies in effectively combining both pattern-recognition and reasoning-based approaches, adapting strategies according to language specifics and development contexts.

Highlighted Bug Example

One particularly insightful example involved a subtle issue in a Ruby-based audio processing library, identified only by Anthropic Sonnet 3.5:

Issue Description:
The bug was found in the TimeStretchProcessor class, specifically the calculation of normalize_gain. The original implementation mistakenly used a fixed formula rather than adjusting dynamically based on the stretch_factor. This caused output audio to have incorrect amplitude—either too loud or too quiet depending on the stretch applied.

Anthropic Sonnet 3.5 logically reasoned through the implications of the audio amplitude and correctly identified the issue. OpenAI 4o, relying more heavily on pattern recognition, missed this nuanced logical flaw.

Final Thoughts

The comparative analysis highlights the complementary strengths of pattern-based and reasoning-based AI models in automated software verification. Understanding these differences helps set clearer expectations and informs future improvements in AI-driven bug detection tools, ultimately supporting developers in producing more reliable, robust software.

Interested in improving your team's code reviews with AI? Try Greptile for free today.


TRY GREPTILE TODAY

AI code reviewer that understands your codebase.

Merge 50-80% faster, catch up to 3X more bugs.

14-days free, no credit card required