Introduction
Identifying subtle, complex bugs remains a persistent challenge in software development. AI-powered code review tools have recently emerged as promising solutions, potentially revolutionizing how developers detect and resolve tricky software errors. In this post, I'll compare two leading AI models—OpenAI 4o and Anthropic Sonnet 3.5—to determine which performs better at detecting hard-to-find software bugs across Python, Go, TypeScript, Rust, and Ruby.
The Evaluation Dataset
I wanted to have the dataset of bugs to cover multiple domains and languages. I picked sixteen domains, picked 2-3 self-contained programs for each domain, and used Cursor to generate each program in TypeScript, Ruby, Python, Go, and Rust.
Next I cycled through and introduced a tiny bug in each one. The type of bug I chose to introduce had to be:
- A bug that a professional developer could reasonably introduce
- A bug that could easily slip through linters, tests, and manual code review
Some examples of bugs I introduced:
- Undefined `response` variable in the ensure block
- Not accounting for amplitude normalization when computing wave stretching on a sound sample
- Hard coded date which would be accurate in most, but not all situations
At the end of this, I had 210 programs, each with a small, difficult-to-catch, and realistic bug.
A disclaimer: these bugs are the hardest-to-catch bugs I could think of, and are not representative of the median bugs usually found in everyday software.
Results
Anthropic Sonnet 3.5 outperformed OpenAI 4o, successfully identifying 26 bugs compared to 20 identified by OpenAI 4o.
Performance by Language
-
Go: Anthropic Sonnet 3.5 detected twice as many bugs as OpenAI 4o (8 vs. 4 out of 42). Its reasoning capability likely helped identify subtle concurrency and synchronization issues in Go.
-
Python: OpenAI 4o performed better, catching 6 bugs compared to Sonnet 3.5's 3. Python’s extensive training data and familiar patterns likely benefited OpenAI’s pattern-matching approach.
-
TypeScript: Performance was similar, with Anthropic Sonnet 3.5 finding 5 bugs, narrowly outperforming OpenAI 4o, which found 4. This reflects comparable pattern recognition and reasoning capabilities in strongly typed languages.
-
Rust: Both models performed equally, detecting 3 out of 42 bugs. Rust’s structured and safety-oriented codebase may equally suit both pattern-based and reasoning approaches.
-
Ruby: Anthropic Sonnet 3.5 significantly outperformed OpenAI 4o, identifying 7 bugs compared to just 3 by OpenAI 4o. Ruby’s dynamic typing and complex logic flow favored Anthropic’s reasoning-focused architecture.
Analysis and Insights
The differences between OpenAI 4o and Anthropic Sonnet 3.5 underscore how varied AI architectures and training methods influence bug detection performance. Sonnet 3.5's reasoning capabilities excelled in languages with less straightforward pattern matching or less training data (like Ruby and Go), indicating that logical inference can significantly enhance bug detection in certain contexts.
Conversely, OpenAI 4o’s strength in Python emphasizes how extensive training datasets and pattern recognition are advantageous for widely used languages.
These insights suggest the future of AI bug detection tools lies in effectively combining both pattern-recognition and reasoning-based approaches, adapting strategies according to language specifics and development contexts.
Highlighted Bug Example
One particularly insightful example involved a subtle issue in a Ruby-based audio processing library, identified only by Anthropic Sonnet 3.5:
Issue Description:
The bug was found in the TimeStretchProcessor
class, specifically the calculation of normalize_gain
. The original implementation mistakenly used a fixed formula rather than adjusting dynamically based on the stretch_factor
. This caused output audio to have incorrect amplitude—either too loud or too quiet depending on the stretch applied.
Anthropic Sonnet 3.5 logically reasoned through the implications of the audio amplitude and correctly identified the issue. OpenAI 4o, relying more heavily on pattern recognition, missed this nuanced logical flaw.
Final Thoughts
The comparative analysis highlights the complementary strengths of pattern-based and reasoning-based AI models in automated software verification. Understanding these differences helps set clearer expectations and informs future improvements in AI-driven bug detection tools, ultimately supporting developers in producing more reliable, robust software.
Interested in improving your team's code reviews with AI? Try Greptile for free today.