AI CODE REVIEW
EVALUATION (2025)
An evaluation of 5 AI code review tools across 50 real-world bugs from production codebases. See which tools actually catch the issues that matter.
See a mistake on this page? Let us know!
Overview
This is a benchmark of the five leading AI code review tools: Greptile, Cursor BugBot, Copilot, CodeRabbit, and Graphite.
As AI code review tools are seeing increased adoption, engineering teams face the question: which ones actually catch the bugs that matter?
To find out, we evaluated the five most popular AI code reviewers, on 50 real bugs from major open-source projects like Sentry and Cal.com.
Each tool was evaluated out of the box with no custom configuration, measuring not just catch rates but comment quality, noise, review time and setup experience. The results give a realistic picture of what these tools are capable of today, and how they perform in practice.
Methodology
Full disclosure: I'm Everett, Growth Lead at Greptile. While we're one of the tools being evaluated, I wanted to create an objective comparison to help compare AI code review tools. All testing was done with the same methodology across all tools, with the complete dataset available view each tool's output and compare it against the others.
Test Repositories
To assemble the dataset, I picked five open source GitHub repos, each written in a different language:
Using the commit history of each repo, I found 10 real PRs that fixed bugs, then traced each one back to the commit where the bug was originally introduced. The selected bugs reflect a range of real-world issues pushed to actual production codebases. Extremely large or single-file changes were excluded.
From there, I created two branches: one before the bug was introduced and one after it was fixed. Using these, I opened fresh multi-file PRs that reintroduced the original bugs exactly as they were pushed.
Each PR was opened on five separate cloned repos, each running one of the AI code review tools. If a tool flagged the root problem, it got credit. All results were verified manually by reading the tool outputs and comparing them to the actual bug.
Tool Execution Details
Each tool was run using the GitHub CLI on a free trial of its cloud plan. No custom rules were configured for any of the tools. All reviews used the tool's default settings to reflect the experience a developer would get immediately upon signing up.
Reviews were triggered by opening a PR or calling the bot manually in a clean forked repo for each tool. All tools were given access to the full repository, including the PR diffs and base branches.
Test Design
Each bug was assigned a severity level—Critical, High, Medium, or Low—based on its potential impact if pushed to production:
Critical
Prevents the application from compiling or running, introduces major security vulnerabilities, or breaks core product functionality.
High
Does not crash the system but introduces serious risks—data corruption, deadlocks, broken features, or widespread access failures.
Medium
Affects performance, monitoring accuracy, or non-critical behavior. May cause confusion without breaking core functionality.
Low
Minor or cosmetic issues such as incorrect log levels, missing React keys, or inconsistent styles.
A bug was considered "caught" only if it was directly mentioned in an individual comment within the PR. Mentions in the summary alone did not count. While most bugs were isolated to a single file, they often had broader implications across the codebase. To receive credit, a tool needed to flag the faulty code directly and explain the impact of the issue through a line-level comment.
Limitations
This evaluation was conducted in July 2025. As these tools are rapidly evolving, their performance may change over time.
While tools often commented on additional issues within each PR, only their ability to detect the original, known bug was considered in the scoring. False positives, style suggestions, and unrelated comments were not evaluated when determining catch rate.
Performance
After running the five tools on the set of 50 PRs, the evaluation revealed differences in how each AI code review tool catches bugs, balances signal and noise, handles varying programming languages, and contributes to the overall quality of the code review process.
Bug Catch Performance
Here are the complete results from our evaluation of 50 real-world bugs across five AI code review tools:
Bug Detection Performance
Greptile significantly outperformed all other tools with an 82% catch rate, detecting 41% more bugs than the second-place tool, Cursor (58%). The results show a clear performance hierarchy: Greptile leads decisively, followed by Cursor and Copilot in the mid-50% range, CodeRabbit at 44%, while Graphite fell far behind with only 6% detection capability across all bug types.
Catch Rate by Severity
To understand how each tool performs across different types of risk, the chart below shows catch rates broken down by severity level. This highlights whether tools were more effective at catching critical issues or skewed toward lower-impact problems.
Bug Detection by Severity Level
Greptile was the top performer, catching 58.3% of critical bugs and 100% of high severity issues, with 88.9% accuracy on medium severity. Cursor followed closely, matching Greptile on critical bugs caught (58.3%) and scoring 64.3% on high severity. Copilot excelled on medium severity (77.8%) but lagged on high severity (57.1%). CodeRabbit delivered steady but lower performance, strongest on low severity bugs (53.3%) and weakest on critical catches (33.3%). Graphite failed to catch any high or low severity issues, but performed strongest on critical issues.
Overall Experience
The overall experience reflects my hands-on use of each tool. While the criteria are subjective, they were applied consistently. Noise was based on the number of comments. Comment quality measured how clear and actionable each in-line comment was. Summary usefulness captured how well the tool summarized the PR. Ease of setup reflected how smooth the installation and configuration process was. Average wait time was measured from PR open to review completion.
Tool | Noise | Comment Quality | Summary | Setup | Avg Wait |
---|---|---|---|---|---|
Greptile | High | ★★★☆Good | ★★★★Excellent | ★★★★Excellent | 288s |
Copilot | High | ★★★☆Good | ★★☆☆Fair | ★☆☆☆Subpar | 29s |
CodeRabbit | Moderate | ★★★☆Good | ★★★★Excellent | ★★★★Excellent | 206s |
Cursor | Low | ★★☆☆Fair | N/AN/A | ★★☆☆Fair | 164s |
Graphite | Low | ★★☆☆Fair | N/AN/A | ★★☆☆Fair | 48s |
Greptile and CodeRabbit were the only tools to consistently deliver helpful summaries and high-quality comments, but both came with longer wait times. Copilot was fast but often lacked depth and made a lot of unhelpful comments. Cursor made good catches but was very selective with its comments often combining multiple issues into one comment. Graphite was by far the least noisy, leaving only eight comments across all tests.
Findings by Repo
Performance varied noticeably across repositories, suggesting that some tools are better suited to certain languages, frameworks, or codebase structures. For each repo, you can find information on the PRs, along with whether each tool successfully identified the issue.
In the tables below, each tool name links to its corresponding repository, and each checkmark or /* TODO: Review icon migration */ X links to the specific PR where you can explore the tool's comments, summaries, and overall output for each issue.
Enhanced Pagination Performance for High-Volume Audit Logs
OptimizedCursorPaginator
Optimize spans buffer insertion with eviction during insert
Support upsampled error count with performance optimizations
sample_rate = 0.0
is falsy and skippedGitHub OAuth Security Enhancement
github_authenticated_user
state is missingReplays Self-Serve Bulk Delete System
Span Buffer Multiprocess Enhancement with Health Monitoring
Implement cross-system issue synchronization
Reorganize incident creation / issue occurrence logic
Add ability to use queues to manage parallelism
queue.ShutDown
exception handlingAdd hook for producing occurrences from the stateful detector
pass
)Total Catches
PR Title | Bug Description | Greptile | Copilot | CodeRabbit | Cursor | Graphite |
---|---|---|---|---|---|---|
Enhanced Pagination Performance for High-Volume Audit Logs | Importing non-existent OptimizedCursorPaginator High — Immediate runtime failure | |||||
Optimize spans buffer insertion with eviction during insert | Negative offset cursor manipulation bypasses pagination boundaries Critical — Security vulnerability | |||||
Support upsampled error count with performance optimizations | sample_rate = 0.0 is falsy and skippedLow — Affects test utilities only | |||||
GitHub OAuth Security Enhancement | Null reference if github_authenticated_user state is missingCritical — Crashes in production | |||||
Replays Self-Serve Bulk Delete System | Breaking changes in error response format Critical — Breaks existing API consumers | |||||
Span Buffer Multiprocess Enhancement with Health Monitoring | Inconsistent metric tagging with 'shard' and 'shards' Medium — Hinders monitoring/debugging | |||||
Implement cross-system issue synchronization | Shared mutable default in dataclass timestamp Medium — Unexpected shared state | |||||
Reorganize incident creation / issue occurrence logic | Using stale config variable instead of updated one High — Uses stale configuration | |||||
Add ability to use queues to manage parallelism | Invalid queue.ShutDown exception handlingHigh — Message loss from unhandled failures | |||||
Add hook for producing occurrences from the stateful detector | Incomplete implementation (only contains pass )High — Missing core logic | |||||
Total Catches | 8/10 | 4/10 | 3/10 | 4/10 | 0/10 |
The choice of which to use ultimately depends on your team's priorities. If you need maximum bug detection, Greptile performs best. For teams wanting minimal noise, Cursor and Graphite's conservative approach may draw appeal. Those prioritizing speed might prefer Copilot, while teams seeking in depth PR summaries should consider CodeRabbit.
As these tools rapidly evolve, we plan to update this evaluation regularly. The complete dataset remains available for teams wanting to dig deeper into specific scenarios or run their own comparisons.