OpenAI o1-mini vs Anthropic Sonnet 3.5: AI Models Compared on Hard Bug Detection

Effective bug detection in software development is critical, and the role of AI-powered tools has never been more important. At Greptile, we leverage AI-driven code reviews to uncover subtle yet serious bugs that traditional approaches can overlook.

In this blog post, I compare two advanced AI language models: OpenAI o1-mini and Anthropic Sonnet 3.5, evaluating their capabilities in identifying hard-to-detect software bugs. Unlike code generation, bug detection requires deep logical reasoning in addition to robust pattern recognition—making this comparison particularly insightful.

Evaluation Setup

To thoroughly assess each model, I introduced 210 challenging, realistic bugs distributed evenly across five popular programming languages:

Python
TypeScript
Go
Rust
Ruby

Each bug was carefully chosen to reflect subtle errors that experienced developers might unintentionally introduce, often slipping through standard automated tests, linters, and manual code reviews.

ID
1	distributed microservices platform
2	event-driven simulation engine
3	containerized development environment manager
4	natural language processing toolkit
5	predictive anomaly detection system
6	decentralized voting platform
7	smart contract development framework
8	custom peer-to-peer network protocol
9	real-time collaboration platform
10	progressive web app framework
11	webassembly compiler and runtime
12	serverless orchestration platform
13	procedural world generation engine
14	ai-powered game testing framework
15	multiplayer game networking engine
16	big data processing framework
17	real-time data visualization platform
18	machine learning model monitoring system
19	advanced encryption toolkit
20	penetration testing automation framework
21	iot device management platform
22	edge computing framework
23	smart home automation system
24	quantum computing simulation environment
25	bioinformatics analysis toolkit
26	climate modeling and simulation platform
27	advanced code generation ai
28	automated code refactoring tool
29	comprehensive developer productivity suite
30	algorithmic trading platform
31	blockchain-based supply chain tracker
32	personal finance management ai
33	advanced audio processing library
34	immersive virtual reality development framework
35	serverless computing optimizer
36	distributed machine learning training framework
37	robotic process automation rpa platform
38	adaptive learning management system
39	interactive coding education platform
40	language learning ai tutor
41	comprehensive personal assistant framework
42	multiplayer collaboration platform

Results

Overall Performance

Overall, Anthropic Sonnet 3.5 significantly outperformed OpenAI o1-mini:

Anthropic Sonnet 3.5: Identified 26 out of 210 bugs.
OpenAI o1-mini: Identified 11 out of 210 bugs.

This substantial difference highlights the advantage of Sonnet 3.5’s built-in reasoning capabilities.

Language-Specific Breakdown

Detailed results across programming languages provided additional insights:

Python:
- Anthropic Sonnet 3.5: 3/42 bugs detected
- OpenAI o1-mini: 2/42 bugs detected (Both struggled, slight advantage Sonnet 3.5)
TypeScript:
- Anthropic Sonnet 3.5: 5/42 bugs detected
- OpenAI o1-mini: 1/42 bugs detected (Clear advantage Sonnet 3.5)
Go:
- Anthropic Sonnet 3.5: 8/42 bugs detected
- OpenAI o1-mini: 2/42 bugs detected (Strong performance by Sonnet 3.5)
Rust:
- Anthropic Sonnet 3.5: 3/41 bugs detected
- OpenAI o1-mini: 2/41 bugs detected (Close, slight advantage Sonnet 3.5)
Ruby:
- Anthropic Sonnet 3.5: 7/42 bugs detected
- OpenAI o1-mini: 4/42 bugs detected (Sonnet 3.5 significantly better)

Insights and Analysis

Anthropic’s Sonnet 3.5 clearly demonstrated superior overall performance, particularly in Ruby, TypeScript, and Go. This suggests that its embedded reasoning or planning phase provides meaningful advantages, especially in languages with limited representation in traditional training datasets.

Reasoning models like Sonnet 3.5 explicitly analyze code logic, enabling them to identify subtle logical inconsistencies or edge-case vulnerabilities. This approach is especially beneficial in less common languages or scenarios where traditional pattern recognition alone falls short.

Conversely, OpenAI’s o1-mini, heavily reliant on pattern matching, performed competitively in well-documented languages such as Python, reflecting the adequacy of its pattern-based heuristics for widely-used, syntax-driven contexts.

Highlighted Bug Example: Logical Vulnerability in CryptoUtil (Test #7)

An illustrative example highlighting Anthropic Sonnet 3.5’s advantage involved a critical logical vulnerability in the CryptoUtil.unblind() method, where XOR operations incorrectly assumed equal lengths for the blinded signature and blinding factor:

Anthropic Sonnet 3.5’s Analysis:
"The critical issue in CryptoUtil.unblind() arises from assuming equal lengths of the blinded signature and the blinding factor during an XOR operation. This incorrect assumption creates a logical vulnerability potentially exploitable in cryptographic contexts."

OpenAI o1-mini missed this significant flaw entirely, while Anthropic Sonnet 3.5 identified it through careful logical analysis, clearly demonstrating its strength in reasoning about potential security and logic flaws.

Final Thoughts

This evaluation underscores the value of reasoning-enhanced models such as Anthropic’s Sonnet 3.5 in detecting complex software bugs. While both models exhibit strengths and limitations, Sonnet 3.5’s deeper logical reasoning provides a compelling advantage, particularly in nuanced scenarios.

As AI-driven code review continues evolving, models equipped with advanced reasoning capabilities like Anthropic Sonnet 3.5 are poised to significantly improve software reliability and developer productivity, becoming indispensable tools for future software verification tasks.