Bug detection requires more than surface-level pattern recognition—it’s a reasoning problem. As LLMs are deployed in developer workflows, their ability to identify bugs before they hit production is being put to the test.
In this benchmark, we evaluated two OpenAI models—o1 and the newer 4o-mini—on their ability to catch real-world bugs across five programming languages.
🧪 The Evaluation Dataset
I wanted to have the dataset of bugs to cover multiple domains and languages. I picked sixteen domains, picked 2-3 self-contained programs for each domain, and used Cursor to generate each program in TypeScript, Ruby, Python, Go, and Rust.
Here are the programs we gnerated for the evaluation:
Next I cycled through and introduced a tiny bug in each one. The type of bug I chose to introduce had to be:
- A bug that a professional developer could reasonably introduce
- A bug that could easily slip through linters, tests, and manual code review
Some examples of bugs I introduced:
- Undefined \response\ variable in the ensure block
- Not accounting for amplitude normalization when computing wave stretching on a sound sample
- Hard coded date which would be accurate in most, but not all situations
At the end of this, I had 210 programs, each with a small, difficult-to-catch, and realistic bug.
A disclaimer: these bugs are the hardest-to-catch bugs I could think of, and are not representative of the median bugs usually found in everyday software.
📊 Results
Overall Bugs Detected
- 4o-mini: 19
- o1: 15
Language Breakdown
-
Python:
- o1: 2
- 4o-mini: 4
-
TypeScript:
- o1: 4
- 4o-mini: 2
-
Go:
- o1: 2
- 4o-mini: 3
-
Rust:
- o1: 3
- 4o-mini: 4
-
Ruby:
- o1: 4
- 4o-mini: 6
4o-mini outperformed o1 in four out of five languages, with especially strong results in Ruby and Python. The only exception was TypeScript, where o1 had the upper hand.
💡 Analysis
These results suggest that 4o-mini is generally stronger when logical reasoning is required. In languages like Ruby and Rust—where LLM training data is sparser—pattern-based models like o1 tend to struggle. But 4o-mini's added reasoning phase helps it infer behavior and detect bugs that don’t follow obvious patterns.
That said, o1 performed slightly better in TypeScript, a highly structured and well-represented language in training corpora. Here, simpler pattern recognition often works well enough.
The difference boils down to this:
- o1 excels when there are clear patterns.
- 4o-mini is more robust when those patterns break down.
🐞 A Bug Worth Highlighting
Test #1 — Ruby: Incorrect Gain Scaling in Audio Library
The bug appeared in a TimeStretchProcessor class that handled audio transformation. The code used a fixed formula for normalize_gain, ignoring the stretch_factor that determines playback speed. This led to audio being too loud or too quiet depending on how much it was slowed down or sped up.
- 4o-mini detected the issue
- o1 missed it
4o-mini’s analysis:
"The gain should scale relative to the stretch_factor. Using a fixed gain ignores playback speed and leads to amplitude inconsistency."
This example shows where reasoning outperforms memorization. 4o-mini connected the stretch logic with amplitude—something o1 failed to do.
✅ Final Thoughts
While both o1 and 4o-mini offer value in bug detection, 4o-mini’s reasoning ability makes it better suited for real-world reviews, especially in less conventional codebases.
- Choose 4o-mini if you care about deeper bug detection in tricky or unfamiliar code.
- Use o1 when working in high-volume, pattern-rich environments where speed matters more than nuance.
Greptile uses models like 4o-mini in production to catch concurrency issues, logic bugs, and sneaky edge cases before they ship. Want to see what it catches in your codebase? Try Greptile — no credit card required.