Greptile AI Code Review Tool - Automated GitHub PR Review Bot with Full Codebase Understanding
OpenAI o1-mini vs Anthropic Sonnet 3.5: AI Models Compared on Hard Bug Detection

OpenAI o1-mini vs Anthropic Sonnet 3.5: AI Models Compared on Hard Bug Detection

April 26, 2025 (6d ago)

Written by Everett Butler

Effective bug detection in software development is critical, and the role of AI-powered tools has never been more important. At Greptile, we leverage AI-driven code reviews to uncover subtle yet serious bugs that traditional approaches can overlook.

In this blog post, I compare two advanced AI language models: OpenAI o1-mini and Anthropic Sonnet 3.5, evaluating their capabilities in identifying hard-to-detect software bugs. Unlike code generation, bug detection requires deep logical reasoning in addition to robust pattern recognition—making this comparison particularly insightful.

Evaluation Setup

To thoroughly assess each model, I introduced 210 challenging, realistic bugs distributed evenly across five popular programming languages:

  • Python
  • TypeScript
  • Go
  • Rust
  • Ruby

Each bug was carefully chosen to reflect subtle errors that experienced developers might unintentionally introduce, often slipping through standard automated tests, linters, and manual code reviews.

ID

1distributed microservices platform
2event-driven simulation engine
3containerized development environment manager
4natural language processing toolkit
5predictive anomaly detection system
6decentralized voting platform
7smart contract development framework
8custom peer-to-peer network protocol
9real-time collaboration platform
10progressive web app framework
11webassembly compiler and runtime
12serverless orchestration platform
13procedural world generation engine
14ai-powered game testing framework
15multiplayer game networking engine
16big data processing framework
17real-time data visualization platform
18machine learning model monitoring system
19advanced encryption toolkit
20penetration testing automation framework
21iot device management platform
22edge computing framework
23smart home automation system
24quantum computing simulation environment
25bioinformatics analysis toolkit
26climate modeling and simulation platform
27advanced code generation ai
28automated code refactoring tool
29comprehensive developer productivity suite
30algorithmic trading platform
31blockchain-based supply chain tracker
32personal finance management ai
33advanced audio processing library
34immersive virtual reality development framework
35serverless computing optimizer
36distributed machine learning training framework
37robotic process automation rpa platform
38adaptive learning management system
39interactive coding education platform
40language learning ai tutor
41comprehensive personal assistant framework
42multiplayer collaboration platform

Results

Overall Performance

Overall, Anthropic Sonnet 3.5 significantly outperformed OpenAI o1-mini:

  • Anthropic Sonnet 3.5: Identified 26 out of 210 bugs.
  • OpenAI o1-mini: Identified 11 out of 210 bugs.

This substantial difference highlights the advantage of Sonnet 3.5’s built-in reasoning capabilities.

Language-Specific Breakdown

Detailed results across programming languages provided additional insights:

  • Python:

    • Anthropic Sonnet 3.5: 3/42 bugs detected
    • OpenAI o1-mini: 2/42 bugs detected (Both struggled, slight advantage Sonnet 3.5)
  • TypeScript:

    • Anthropic Sonnet 3.5: 5/42 bugs detected
    • OpenAI o1-mini: 1/42 bugs detected (Clear advantage Sonnet 3.5)
  • Go:

    • Anthropic Sonnet 3.5: 8/42 bugs detected
    • OpenAI o1-mini: 2/42 bugs detected (Strong performance by Sonnet 3.5)
  • Rust:

    • Anthropic Sonnet 3.5: 3/41 bugs detected
    • OpenAI o1-mini: 2/41 bugs detected (Close, slight advantage Sonnet 3.5)
  • Ruby:

    • Anthropic Sonnet 3.5: 7/42 bugs detected
    • OpenAI o1-mini: 4/42 bugs detected (Sonnet 3.5 significantly better)

Insights and Analysis

Anthropic’s Sonnet 3.5 clearly demonstrated superior overall performance, particularly in Ruby, TypeScript, and Go. This suggests that its embedded reasoning or planning phase provides meaningful advantages, especially in languages with limited representation in traditional training datasets.

Reasoning models like Sonnet 3.5 explicitly analyze code logic, enabling them to identify subtle logical inconsistencies or edge-case vulnerabilities. This approach is especially beneficial in less common languages or scenarios where traditional pattern recognition alone falls short.

Conversely, OpenAI’s o1-mini, heavily reliant on pattern matching, performed competitively in well-documented languages such as Python, reflecting the adequacy of its pattern-based heuristics for widely-used, syntax-driven contexts.

Highlighted Bug Example: Logical Vulnerability in CryptoUtil (Test #7)

An illustrative example highlighting Anthropic Sonnet 3.5’s advantage involved a critical logical vulnerability in the CryptoUtil.unblind() method, where XOR operations incorrectly assumed equal lengths for the blinded signature and blinding factor:

  • Anthropic Sonnet 3.5’s Analysis:
    "The critical issue in CryptoUtil.unblind() arises from assuming equal lengths of the blinded signature and the blinding factor during an XOR operation. This incorrect assumption creates a logical vulnerability potentially exploitable in cryptographic contexts."

OpenAI o1-mini missed this significant flaw entirely, while Anthropic Sonnet 3.5 identified it through careful logical analysis, clearly demonstrating its strength in reasoning about potential security and logic flaws.

Final Thoughts

This evaluation underscores the value of reasoning-enhanced models such as Anthropic’s Sonnet 3.5 in detecting complex software bugs. While both models exhibit strengths and limitations, Sonnet 3.5’s deeper logical reasoning provides a compelling advantage, particularly in nuanced scenarios.

As AI-driven code review continues evolving, models equipped with advanced reasoning capabilities like Anthropic Sonnet 3.5 are poised to significantly improve software reliability and developer productivity, becoming indispensable tools for future software verification tasks.


TRY GREPTILE TODAY

AI code reviewer that understands your codebase.

Merge 50-80% faster, catch up to 3X more bugs.

14-days free, no credit card required