Greptile AI Code Review Tool - Automated GitHub PR Review Bot with Full Codebase Understanding
OpenAI o1 vs Anthropic Sonnet 3.5: Which AI is Better at Bug Detection?

OpenAI o1 vs Anthropic Sonnet 3.5: Which AI is Better at Bug Detection?

April 24, 2025 (1w ago)

Written by Everett Butler

Ensuring code robustness and catching elusive bugs before deployment is becoming increasingly challenging as software complexity grows. At Greptile, we leverage AI-driven code review to pinpoint subtle logical flaws and anomalies traditional tools might overlook.

Recently, I conducted a rigorous evaluation of two prominent large language models—OpenAI o1 and Anthropic Sonnet 3.5—to gauge their effectiveness at uncovering challenging bugs. Detecting these issues requires more than syntax checking; it demands deep logic comprehension, reasoning about concurrency, and nuanced understanding of language-specific complexities.

Evaluation Setup

To comprehensively assess each model's capabilities, I constructed a diverse dataset of 210 difficult-to-detect bugs, evenly distributed across five popular programming languages:

  • Python
  • TypeScript
  • Go
  • Rust
  • Ruby

Each bug was deliberately subtle and realistic, designed specifically to evade standard linters, automated tests, and casual manual reviews.

ID

1distributed microservices platform
2event-driven simulation engine
3containerized development environment manager
4natural language processing toolkit
5predictive anomaly detection system
6decentralized voting platform
7smart contract development framework
8custom peer-to-peer network protocol
9real-time collaboration platform
10progressive web app framework
11webassembly compiler and runtime
12serverless orchestration platform
13procedural world generation engine
14ai-powered game testing framework
15multiplayer game networking engine
16big data processing framework
17real-time data visualization platform
18machine learning model monitoring system
19advanced encryption toolkit
20penetration testing automation framework
21iot device management platform
22edge computing framework
23smart home automation system
24quantum computing simulation environment
25bioinformatics analysis toolkit
26climate modeling and simulation platform
27advanced code generation ai
28automated code refactoring tool
29comprehensive developer productivity suite
30algorithmic trading platform
31blockchain-based supply chain tracker
32personal finance management ai
33advanced audio processing library
34immersive virtual reality development framework
35serverless computing optimizer
36distributed machine learning training framework
37robotic process automation rpa platform
38adaptive learning management system
39interactive coding education platform
40language learning ai tutor
41comprehensive personal assistant framework
42multiplayer collaboration platform

Results

Overall Performance

Across all tested scenarios, Anthropic Sonnet 3.5 demonstrated clear superiority:

  • Anthropic Sonnet 3.5 successfully identified 26 out of 210 bugs.
  • OpenAI o1 identified 15 out of 210 bugs.

This significant difference underscores Sonnet 3.5's advantage, likely due to its embedded reasoning capability.

Performance Breakdown by Language

Let's delve deeper into how each model performed by language:

Go

  • Anthropic Sonnet 3.5: 8/42 bugs detected
  • OpenAI o1: 2/42 bugs detected

Anthropic Sonnet 3.5 notably excelled here, likely benefiting from its reasoning capability to handle Go’s concurrency-heavy architecture.

Python

  • Anthropic Sonnet 3.5: 3/42 bugs detected
  • OpenAI o1: 2/42 bugs detected

Both models struggled in Python, though Sonnet 3.5 edged slightly ahead.

TypeScript

  • Anthropic Sonnet 3.5: 5/42 bugs detected
  • OpenAI o1: 4/42 bugs detected

Performance was closely matched in TypeScript, with Sonnet 3.5 slightly outperforming o1.

Rust

  • Anthropic Sonnet 3.5: 3/41 bugs detected
  • OpenAI o1: 3/41 bugs detected

The models were evenly matched for Rust, reflecting the inherent complexity of Rust’s error patterns and systems-level constructs.

Ruby

  • Anthropic Sonnet 3.5: 7/42 bugs detected
  • OpenAI o1: 4/42 bugs detected

Sonnet 3.5 outperformed o1 significantly here, showcasing the benefit of reasoning capabilities for Ruby's dynamic, nuanced environment.

Why Did Sonnet 3.5 Perform Better?

Anthropic Sonnet 3.5's superior results can largely be attributed to its integrated reasoning step, enabling the model to logically explore potential errors before generating its output. Unlike models primarily relying on pattern recognition—such as OpenAI o1—this reasoning process allows Sonnet 3.5 to more effectively identify subtle logical issues, particularly in less common or complex language environments.

While pattern-matching is sufficient for languages with extensive training datasets (like Python or TypeScript), languages with fewer training examples—such as Go and Ruby—benefit greatly from a model that systematically evaluates logic and intent.

Highlighted Bug Example: Audio Gain Calculation (Ruby)

One particularly insightful example (Test #1) highlights Sonnet 3.5’s advantage clearly:

  • Anthropic Sonnet 3.5’s reasoning:
    "The bug is in the TimeStretchProcessor class of a Ruby audio processing library, specifically within the calculation of normalize_gain. The current implementation uses a fixed formula rather than adjusting the gain based on the stretch_factor—the value representing how much audio is sped up or slowed down. This causes incorrect amplitude outputs, either too loud or too quiet depending on the stretch applied. The correct implementation should scale the gain proportionally to the stretch factor."

Anthropic Sonnet 3.5 detected this nuanced logic flaw by assessing the intended algorithmic behavior against actual implementation. OpenAI o1 missed this subtlety entirely, highlighting Sonnet 3.5’s capability to perform deeper logical reasoning—crucial for catching sophisticated bugs.

Final Thoughts

This evaluation illustrates that while both OpenAI o1 and Anthropic Sonnet 3.5 have strengths, the added reasoning capabilities of Sonnet 3.5 deliver significant practical benefits in real-world bug detection. As software systems continue growing in complexity, reasoning-enhanced AI models promise to become essential tools for developers aiming to maintain robust, error-free codebases.


TRY GREPTILE TODAY

AI code reviewer that understands your codebase.

Merge 50-80% faster, catch up to 3X more bugs.

14-days free, no credit card required