AI models are gaining traction in code generation—but what about code review? In this post, we evaluate OpenAI’s o1-mini and DeepSeek’s R1 for their ability to detect subtle software bugs in real-world programs. The results highlight how reasoning capabilities affect performance across different programming languages.
🧪 The Evaluation Dataset
I wanted to have the dataset of bugs to cover multiple domains and languages. I picked sixteen domains, picked 2-3 self-contained programs for each domain, and used Cursor to generate each program in TypeScript, Ruby, Python, Go, and Rust.
Here are the programs we created for the evaluation:
Next I cycled through and introduced a tiny bug in each one. The type of bug I chose to introduce had to be:
- A bug that a professional developer could reasonably introduce
- A bug that could easily slip through linters, tests, and manual code review
Some examples of bugs I introduced:
- Undefined \response\ variable in the ensure block
- Not accounting for amplitude normalization when computing wave stretching on a sound sample
- Hard coded date which would be accurate in most, but not all situations
At the end of this, I had 210 programs, each with a small, difficult-to-catch, and realistic bug.
A disclaimer: these bugs are the hardest-to-catch bugs I could think of, and are not representative of the median bugs usually found in everyday software.
📊 Results
Overall Bugs Caught
- DeepSeek R1: 23
- OpenAI o1-mini: 11
DeepSeek found over 2× more bugs than o1-mini.
By Language
-
Python:
- o1-mini: 2
- DeepSeek: 3
-
TypeScript:
- o1-mini: 1
- DeepSeek: 6
-
Go:
- o1-mini: 2
- DeepSeek: 3
-
Rust:
- o1-mini: 2
- DeepSeek: 7
-
Ruby:
- Both models: 4
DeepSeek led in every language except Ruby (tied) and showed its strongest edge in TypeScript and Rust, where reasoning through asynchronous logic and error handling is crucial.
💡 Analysis
DeepSeek R1 outperformed o1-mini across the board, likely due to a stronger planning step and deeper reasoning loop before response generation.
- In TypeScript and Rust, its ability to trace execution flow and identify logic errors gives it an edge.
- In Go and Python, it provides incremental improvements, especially with concurrency or subtle semantic bugs.
- Ruby was a tie—possibly due to limited training coverage or model exposure for both.
Meanwhile, o1-mini remains a solid baseline—fast, efficient, and competent on simple, pattern-based bugs—but struggles with deeper logic or asynchronous issues.
🐞 A Bug Worth Highlighting
Test #2 — Go: Race Condition in State Sync
A concurrency bug was planted in a Go-based smart home system. Device states were being broadcasted without proper locking, allowing clients to receive stale or partially updated data.
- DeepSeek R1 caught it
- o1-mini missed it
DeepSeek’s output:
"The bug stems from a lack of synchronization around state updates. Without locking, race conditions may result in inconsistent device states being broadcast to clients."
This required understanding of shared state access and goroutine scheduling—something models with reasoning capabilities are increasingly good at.
✅ Final Thoughts
This benchmark shows that DeepSeek R1 clearly outperforms OpenAI o1-mini in AI-assisted bug detection.
- Use o1-mini if you need speed and efficiency for lightweight tasks.
- Choose DeepSeek R1 if you want broader language coverage and better reasoning for tricky bugs—especially in concurrency-heavy or async-first environments.
Greptile uses models like these to surface bugs in PRs before they hit production. Want to try it on your codebase? Try Greptile — no credit card required.