OpenAI o1 vs Anthropic Sonnet 3.7: Which AI Catches Hard Bugs Better?

Written by Everett Butler

April 25, 2025
Scratch Paper
OpenAI o1 vs Anthropic Sonnet 3.7: Which AI Catches Hard Bugs Better?

As code complexity continues to rise, developers face increasing difficulty detecting subtle, logic-based software bugs. At Greptile, we're harnessing AI-driven code review tools to catch these elusive errors, often invisible to standard linters and human reviewers.

Recently, I conducted an evaluation comparing two leading language models—Anthropic Sonnet 3.7 and OpenAI o1—to assess their effectiveness at detecting challenging software bugs. This blog shares the results and insights from that evaluation, exploring what their differing performance indicates for AI-assisted debugging.

Evaluation Setup

To thoroughly test each model, I introduced 210 realistic, subtle bugs into software programs written in five widely-used programming languages:

  • Go
  • Python
  • TypeScript
  • Rust
  • Ruby

These bugs were intentionally designed to mimic subtle mistakes developers commonly make, often slipping past typical code reviews and automated tools.

ID

1distributed microservices platform
2event-driven simulation engine
3containerized development environment manager
4natural language processing toolkit
5predictive anomaly detection system
6decentralized voting platform
7smart contract development framework
8custom peer-to-peer network protocol
9real-time collaboration platform
10progressive web app framework
11webassembly compiler and runtime
12serverless orchestration platform
13procedural world generation engine
14ai-powered game testing framework
15multiplayer game networking engine
16big data processing framework
17real-time data visualization platform
18machine learning model monitoring system
19advanced encryption toolkit
20penetration testing automation framework
21iot device management platform
22edge computing framework
23smart home automation system
24quantum computing simulation environment
25bioinformatics analysis toolkit
26climate modeling and simulation platform
27advanced code generation ai
28automated code refactoring tool
29comprehensive developer productivity suite
30algorithmic trading platform
31blockchain-based supply chain tracker
32personal finance management ai
33advanced audio processing library
34immersive virtual reality development framework
35serverless computing optimizer
36distributed machine learning training framework
37robotic process automation rpa platform
38adaptive learning management system
39interactive coding education platform
40language learning ai tutor
41comprehensive personal assistant framework
42multiplayer collaboration platform

Results

Overall Performance

Across all tests, Anthropic Sonnet 3.7 significantly outperformed OpenAI o1:

  • Anthropic Sonnet 3.7: Detected 32 out of 210 bugs
  • OpenAI o1: Detected 15 out of 210 bugs

This notable difference highlights Sonnet 3.7's potential advantage, particularly due to its enhanced reasoning capability.

Detailed Results by Language

Here's the breakdown of performance across languages:

  • Go:

    • Anthropic Sonnet 3.7: 6/42 bugs
    • OpenAI o1: 2/42 bugs (Sonnet performed notably better)
  • Python:

    • Anthropic Sonnet 3.7: 4/42 bugs
    • OpenAI o1: 2/42 bugs (Slight advantage for Sonnet)
  • TypeScript:

    • Anthropic Sonnet 3.7: 9/42 bugs
    • OpenAI o1: 4/42 bugs (Significant advantage for Sonnet)
  • Rust:

    • Anthropic Sonnet 3.7: 6/41 bugs
    • OpenAI o1: 3/41 bugs (Sonnet detected double the bugs)
  • Ruby:

    • Anthropic Sonnet 3.7: 7/42 bugs
    • OpenAI o1: 4/42 bugs (Clear advantage for Sonnet)

Why Did Anthropic Sonnet 3.7 Perform Better?

Anthropic Sonnet 3.7’s consistently stronger results, especially in languages like Go and TypeScript, likely stem from its reasoning-based architecture. Unlike models primarily relying on pattern recognition (such as OpenAI o1), Sonnet 3.7 incorporates an explicit planning or "thinking" step before responding, enabling it to better understand complex logic and concurrency issues.

This reasoning capability is particularly effective in languages that are less extensively represented in typical training data, such as Ruby and Go, where logical comprehension often surpasses the utility of mere pattern matching.

In contrast, OpenAI o1 performed comparatively better in widely used languages like Python and TypeScript, where extensive training datasets aid effective pattern-based recognition. Yet, even in these languages, Anthropic’s reasoning-driven approach generally proved advantageous.

Highlighting a Notable Bug: Race Condition in Go

An illustrative example highlighting Sonnet 3.7's reasoning capability involves a subtle concurrency bug identified in a Go-based smart home notification system:

  • Test #2 (Go)
    • Anthropic Sonnet 3.7’s Explanation:
      "The critical issue arises in the NotifyDeviceUpdate method of ApiServer. There's no locking mechanism around device state updates before broadcasting, creating potential race conditions where clients may receive stale or partially updated device states."

OpenAI o1 did not detect this concurrency issue, whereas Sonnet 3.7 accurately identified the lack of synchronization. This demonstrates the value of reasoning-based AI models when dealing with nuanced concurrency logic, which can evade traditional pattern-matching approaches.

Final Thoughts

This evaluation demonstrates that while both Anthropic Sonnet 3.7 and OpenAI o1 bring value to automated debugging, Sonnet 3.7’s reasoning capabilities clearly provide an edge—particularly for complex, logic-intensive bug detection scenarios. As AI continues to evolve, reasoning-enhanced models promise to become essential tools, significantly improving the accuracy and effectiveness of AI-assisted software verification.

[ TRY GREPTILE FREE TODAY ]

AI code reviewer that understands your codebase

Merge 50-80% faster, catch up to 3X more bugs.

14 days free • No credit card required