🧠 Introduction
We're building an AI code review tool that finds bugs and anti-patterns in pull requests. Since the quality of our reviews depends heavily on the underlying LLMs, we're constantly testing new models to see how well they detect real-world bugs.
Bug detection is not the same as code generation. It requires a deeper understanding of logic, structure, and developer intent. It's not just about pattern matching—it's about reasoning.
With the release of OpenAI’s 4o, we wanted to know: how does it compare to o1 in finding difficult bugs in code?
🧪 How We Tested
We curated a set of 210 small programs, each with a single subtle bug. The bugs were intentionally tricky—realistic enough for a professional developer to introduce, but hard to catch with linters, tests, or a quick skim.
Each program was written in one of five languages: Python, TypeScript, Go, Rust, or Ruby.
We prompted both o1 and 4o with each buggy file, then evaluated whether the model correctly identified the issue.
Next I cycled through and introduced a tiny bug in each one. The type of bug I chose to introduce had to be:
- A bug that a professional developer could reasonably introduce
- A bug that could easily slip through linters, tests, and manual code review
Some examples of bugs I introduced:
- Undefined \response\ variable in the ensure block
- Not accounting for amplitude normalization when computing wave stretching on a sound sample
- Hard coded date which would be accurate in most, but not all situations
At the end of this, I had 210 programs, each with a small, difficult-to-catch, and realistic bug.
A disclaimer: these bugs are the hardest-to-catch bugs I could think of, and are not representative of the median bugs usually found in everyday software.
📊 What We Found
Across all 210 files, o1 correctly identified 15 bugs. 4o found 20.
That’s not a massive difference, but it’s consistent—and important. These bugs weren’t easy, and a few extra catches can mean the difference between a shipped bug and a clean PR.
Here's what stood out:
-
Python: 4o outperformed o1, catching 6 bugs versus o1’s 2. This might be because Python’s dynamic nature demands more reasoning to spot non-obvious issues.
-
TypeScript: Both models caught 4 bugs. The strong type system may make it easier for both models to detect surface-level issues.
-
Go: 4o found twice as many bugs as o1—4 compared to 2. Go’s concurrency model may benefit from 4o’s stronger logical reasoning.
-
Rust: Both models identified 3 bugs. Rust’s strict compiler and safety checks may flatten the differences here.
-
Ruby: Interestingly, o1 edged out 4o, catching 4 bugs to 4o’s 3. Sample variance could be a factor, or it might reflect differences in training data exposure.
Despite o1 being a reasoning model, 4o showed better performance overall. That suggests 4o’s architecture or training data gives it an edge—not just in pattern recognition, but in logic inference too.
🕵️ A Bug Only 4o Caught
One of the most telling examples came from a small bug in a data partitioning method.
In the get_partition function, the ROUND_ROBIN strategy used random.randint(...) instead of a true round-robin algorithm. That leads to uneven and unpredictable distribution of records across partitions—a logic error, not a syntax mistake.
4o flagged it immediately. o1 missed it entirely.
This kind of bug requires understanding the intent of a strategy, not just its implementation. It’s a great example of why reasoning matters for AI code review.
🚀 Final Thoughts
We’re still early in the evolution of AI for software verification. The fact that any model can find bugs like these—without tests or documentation—is pretty wild.
But models like 4o are starting to push the boundaries. They’re not perfect, but they show clear signs of improvement: catching logic errors, handling subtle language features, and reasoning through non-obvious issues.
As the tooling improves, we expect AI-assisted code review to shift from “nice-to-have” to mission-critical.
And we're building for that future.
Want to see how models like 4o perform on your codebase?
👉 Try Greptile for AI-powered code review