OpenAI 4o-mini vs Anthropic Sonnet 3.5: AI Bug Detection Compared

Introduction

Artificial intelligence continues to play an increasingly important role in software development, particularly in automated bug detection. Traditional debugging methods can be time-consuming and often miss subtle, complex issues. To explore AI’s capabilities further, this article compares two advanced AI models—OpenAI's 4o-mini and Anthropic's Sonnet 3.5—evaluating their effectiveness in identifying challenging bugs across Python, TypeScript, Go, Rust, and Ruby. Let's dive into the findings and insights from this comparison.

The Evaluation Dataset

I wanted to have the dataset of bugs to cover multiple domains and languages. I picked sixteen domains, picked 2-3 self-contained programs for each domain, and used Cursor to generate each program in TypeScript, Ruby, Python, Go, and Rust.

ID
1	distributed microservices platform
2	event-driven simulation engine
3	containerized development environment manager
4	natural language processing toolkit
5	predictive anomaly detection system
6	decentralized voting platform
7	smart contract development framework
8	custom peer-to-peer network protocol
9	real-time collaboration platform
10	progressive web app framework
11	webassembly compiler and runtime
12	serverless orchestration platform
13	procedural world generation engine
14	ai-powered game testing framework
15	multiplayer game networking engine
16	big data processing framework
17	real-time data visualization platform
18	machine learning model monitoring system
19	advanced encryption toolkit
20	penetration testing automation framework
21	iot device management platform
22	edge computing framework
23	smart home automation system
24	quantum computing simulation environment
25	bioinformatics analysis toolkit
26	climate modeling and simulation platform
27	advanced code generation ai
28	automated code refactoring tool
29	comprehensive developer productivity suite
30	algorithmic trading platform
31	blockchain-based supply chain tracker
32	personal finance management ai
33	advanced audio processing library
34	immersive virtual reality development framework
35	serverless computing optimizer
36	distributed machine learning training framework
37	robotic process automation rpa platform
38	adaptive learning management system
39	interactive coding education platform
40	language learning ai tutor
41	comprehensive personal assistant framework
42	multiplayer collaboration platform

Next I cycled through and introduced a tiny bug in each one. The type of bug I chose to introduce had to be:

A bug that a professional developer could reasonably introduce
A bug that could easily slip through linters, tests, and manual code review

Some examples of bugs I introduced:

Undefined `response` variable in the ensure block
Not accounting for amplitude normalization when computing wave stretching on a sound sample
Hard coded date which would be accurate in most, but not all situations

At the end of this, I had 210 programs, each with a small, difficult-to-catch, and realistic bug.

A disclaimer: these bugs are the hardest-to-catch bugs I could think of, and are not representative of the median bugs usually found in everyday software.

Results

Overall Performance

Anthropic Sonnet 3.5 successfully detected 26 bugs.
OpenAI 4o-mini identified 19 bugs.

These results underline the difficulty of the task, but also the promising potential AI holds for enhancing software verification practices.

Performance by Programming Language

The results varied significantly across languages:

Go:
- Anthropic Sonnet 3.5: 8 bugs detected out of 42.
- OpenAI 4o-mini: 3 bugs detected.
- Insight: Sonnet 3.5’s superior performance here suggests an advantage in logical reasoning capabilities, especially valuable in concurrency-heavy languages like Go.
Python:
- Anthropic Sonnet 3.5: 3 bugs detected out of 42.
- OpenAI 4o-mini: 4 bugs detected.
- Insight: OpenAI slightly outperformed, possibly due to its strength in pattern recognition within well-documented languages like Python.
TypeScript:
- Anthropic Sonnet 3.5: 5 bugs detected out of 42.
- OpenAI 4o-mini: 2 bugs detected.
- Insight: Sonnet 3.5’s advantage suggests its deeper reasoning capability excels in strongly typed and structurally complex languages.
Rust:
- Anthropic Sonnet 3.5: 3 bugs detected out of 41.
- OpenAI 4o-mini: 4 bugs detected.
- Insight: Both models showed similar effectiveness, though OpenAI 4o-mini had a slight edge, possibly benefiting from Rust’s clearly defined patterns.
Ruby:
- Anthropic Sonnet 3.5: 7 bugs detected out of 42.
- OpenAI 4o-mini: 6 bugs detected.
- Insight: Sonnet 3.5 showed notable strength, demonstrating its capacity for logical inference in dynamically-typed environments.

Analysis and Key Insights

Anthropic Sonnet 3.5 generally outperformed OpenAI 4o-mini, particularly in languages with fewer standardized patterns or less abundant training data. This success can be attributed to Sonnet 3.5’s architectural emphasis on a reasoning phase before generating outputs, allowing it to interpret and logically deduce code behavior more effectively.

Conversely, OpenAI 4o-mini’s stronger performance in languages like Python and Rust highlights its reliance on rapid, pattern-based recognition, which works well with extensively documented, commonly encountered coding issues.

These differences underscore a crucial insight: integrating explicit reasoning processes into AI-driven bug detection can significantly enhance model performance, especially in contexts where mere pattern recognition is insufficient.

Highlighted Bug Example

An insightful example comes from a Ruby-based audio processing library, involving a subtle logic error in gain calculation, uniquely identified by Anthropic Sonnet 3.5:

Test Case: Ruby Bug #1 (Gain Calculation Error)

Sonnet 3.5 Reasoning Output:

"The bug in this file is in the TimeStretchProcessor class, specifically how it calculates normalize_gain. It incorrectly uses a fixed formula without considering the stretch_factor. This oversight causes audio outputs to have incorrect amplitude levels. By logically reasoning through the relationship between the stretch_factor and gain adjustments, Sonnet 3.5 correctly identified this inconsistency."

This specific example emphasizes how Sonnet 3.5’s reasoning capability allows it to identify logical errors beyond simple syntactic or pattern-based checks, providing a deeper level of bug detection.

Conclusion

The comparative analysis illustrates the strengths and weaknesses of each model, highlighting Anthropic Sonnet 3.5’s impressive reasoning-based bug detection capabilities, especially valuable in less mainstream programming languages. As AI-driven code analysis evolves, integrating reasoning steps within traditional pattern-based architectures could significantly advance software verification practices, enhancing both reliability and developer productivity.

Want to improve your software quality using advanced AI-driven bug detection? Try Greptile today.