AI CODE REVIEW
EVALUATION (2025)
An independent evaluation of 5 AI code review tools across 50 real-world bugs from production codebases. See which tools actually catch the issues that matter.
See a mistake on this page? Let us know!
Overview
This is a benchmark of the five leading AI code review tools: Greptile, Cursor BugBot, Copilot, CodeRabbit, and Graphite.
Methodology
Test Repositories
To evaluate the tools fairly, I picked five open source GitHub repos, each written in a different language:
Python
TypeScript
Go
Java
Ruby
Using the commit history of each repo, I found 10 real PRs that fixed bugs, then traced each one back to the commit where the bug was originally introduced. The selected bugs reflect a range of real-world issues pushed to actual production codebases. Extremely large or single-file changes were excluded.
From there, I created two branches: one before the bug was introduced and one after it was fixed. Using these, I opened fresh multi-file PRs that reintroduced the original bugs exactly as they were pushed.
Each PR was opened on five separate cloned repos, each running one of the AI code review tools. If a tool flagged the root problem, it got credit. All results were verified manually by reading the tool outputs and comparing them to the actual bug.
Tool Execution Details
Each tool was run using the GitHub CLI on a free trial of its cloud plan. No custom rules were configured for any of the tools. All reviews used the tool's default settings to reflect the experience a developer would get immediately upon signing up.
Reviews were triggered by opening a PR or calling the bot manually in a clean forked repo for each tool. All tools were given access to the full repository, including the PR diffs and base branches.
Test Design
Each bug was assigned a severity level—Critical, High, Medium, or Low—based on its potential impact if pushed to production:
Critical
Prevents the application from compiling or running, introduces major security vulnerabilities, or breaks core product functionality.
High
Does not crash the system but introduces serious risks—data corruption, deadlocks, broken features, or widespread access failures.
Medium
Affects performance, monitoring accuracy, or non-critical behavior. May cause confusion without breaking core functionality.
Low
Minor or cosmetic issues such as incorrect log levels, missing React keys, or inconsistent styles.
A bug was considered "caught" only if it was directly mentioned in an individual comment within the PR. Mentions in the summary alone did not count. While most bugs were isolated to a single file, they often had broader implications across the codebase. To receive credit, a tool needed to flag the faulty code directly and explain the impact of the issue through a line-level comment.
Limitations
This evaluation was conducted in July 2025. As these tools are rapidly evolving, their performance may change over time.
While tools often commented on additional issues within each PR, only their ability to detect the original, known bug was considered in the scoring. False positives, style suggestions, and unrelated comments were not evaluated when determining catch rate.
Performance
After running the five tools on the set of 50 PRs, the evaluation revealed differences in how each AI code review tool catches bugs, balances signal and noise, handles varying programming languages, and contributes to the overall quality of the code review process.
Bug Catch Performance
Here are the complete results from our evaluation of 50 real-world bugs across five AI code review tools:
Bug Detection Performance
Greptile significantly outperformed all other tools with an 82% catch rate, detecting 41% more bugs than the second-place tool, Cursor (58%). The results show a clear performance hierarchy: Greptile leads decisively, followed by Cursor and Copilot in the mid-50% range, CodeRabbit at 44%, while Graphite fell far behind with only 6% detection capability across all bug types.
Catch Rate by Severity
To understand how each tool performs across different types of risk, the chart below shows catch rates broken down by severity level. This highlights whether tools were more effective at catching critical issues or skewed toward lower-impact problems.
Bug Detection by Severity Level
Greptile was the top performer, catching 58.3% of critical bugs and 100% of high severity issues, with 88.9% accuracy on medium severity. Cursor followed closely, matching Greptile on critical bugs caught (58.3%) and scoring 64.3% on high severity. Copilot excelled on medium severity (77.8%) but lagged on high severity (57.1%). CodeRabbit delivered steady but lower performance, strongest on low severity bugs (53.3%) and weakest on critical catches (33.3%). Graphite failed to catch any high or low severity issues, but performed strongest on critical issues.
Overall Experience
The overall experience reflects my hands-on use of each tool. While the criteria are subjective, they were applied consistently. Noise was based on the number of comments. Comment quality measured how clear and actionable each in-line comment was. Summary usefulness captured how well the tool summarized the PR. Ease of setup reflected how smooth the installation and configuration process was. Average wait time was measured from PR open to review completion.
Tool | Noise | Comment Quality | Summary | Setup | Avg Wait |
---|---|---|---|---|---|
Greptile | High | ★★★☆Good | ★★★★Excellent | ★★★★Excellent | 288s |
Copilot | High | ★★★☆Good | ★★☆☆Fair | ★☆☆☆Subpar | 29s |
CodeRabbit | Moderate | ★★★☆Good | ★★★★Excellent | ★★★★Excellent | 206s |
Cursor | Low | ★★☆☆Fair | N/AN/A | ★★☆☆Fair | 164s |
Graphite | Low | ★★☆☆Fair | N/AN/A | ★★☆☆Fair | 48s |
Greptile and CodeRabbit were the only tools to consistently deliver helpful summaries and high-quality comments, but both came with longer wait times. Copilot was fast but often lacked depth and made a lot of unhelpful comments. Cursor made good catches but was very selective with its comments often combining multiple issues into one comment. Graphite was by far the least noisy, leaving only eight comments across all tests.
Findings by Repo
Performance varied noticeably across repositories, suggesting that some tools are better suited to certain languages, frameworks, or codebase structures. For each repo, you can find information on the PRs, along with whether each tool successfully identified the issue.
In the catch tables, each tool name links to its corresponding repository, where you can explore the comments, summaries, and overall output it generated for each PR.
PR Title | Bug Description | Greptile | Copilot | CodeRabbit | Cursor | Graphite |
---|---|---|---|---|---|---|
Enhanced Pagination Performance for High-Volume Audit Logs | Importing non-existent OptimizedCursorPaginator High — Immediate runtime failure | |||||
Optimize spans buffer insertion with eviction during insert | Negative offset cursor manipulation bypasses pagination boundaries Critical — Security vulnerability | |||||
Support upsampled error count with performance optimizations | sample_rate = 0.0 is falsy and skippedLow — Affects test utilities only | |||||
GitHub OAuth Security Enhancement | Null reference if github_authenticated_user state is missingCritical — Crashes in production | |||||
Replays Self-Serve Bulk Delete System | Breaking changes in error response format Critical — Breaks existing API consumers | |||||
Span Buffer Multiprocess Enhancement with Health Monitoring | Inconsistent metric tagging with 'shard' and 'shards' Medium — Hinders monitoring/debugging | |||||
Implement cross-system issue synchronization | Shared mutable default in dataclass timestamp Medium — Unexpected shared state | |||||
Reorganize incident creation / issue occurrence logic | Using stale config variable instead of updated one High — Uses stale configuration | |||||
Add ability to use queues to manage parallelism | Invalid queue.ShutDown exception handlingHigh — Message loss from unhandled failures | |||||
Add hook for producing occurrences from the stateful detector | Incomplete implementation (only contains pass )High — Missing core logic | |||||
Total Catches | 8/10 | 4/10 | 3/10 | 4/10 | 0/10 |