AI CODE REVIEW
EVALUATION (2025)

An independent evaluation of 5 AI code review tools across 50 real-world bugs from production codebases. See which tools actually catch the issues that matter.

See a mistake on this page? Let us know!

Independent

Evaluation

Overview

This is a benchmark of the five leading AI code review tools: Greptile, Cursor BugBot, Copilot, CodeRabbit, and Graphite.

Methodology

Test Repositories

To evaluate the tools fairly, I picked five open source GitHub repos, each written in a different language:

Python

Sentry — Error tracking & performance monitoring

TypeScript

Cal.com — Open source scheduling infrastructure

Go

Grafana — Monitoring & observability platform

Java

Keycloak — Identity & access management

Ruby

Discourse — Community discussion platform

Using the commit history of each repo, I found 10 real PRs that fixed bugs, then traced each one back to the commit where the bug was originally introduced. The selected bugs reflect a range of real-world issues pushed to actual production codebases. Extremely large or single-file changes were excluded.

From there, I created two branches: one before the bug was introduced and one after it was fixed. Using these, I opened fresh multi-file PRs that reintroduced the original bugs exactly as they were pushed.

Each PR was opened on five separate cloned repos, each running one of the AI code review tools. If a tool flagged the root problem, it got credit. All results were verified manually by reading the tool outputs and comparing them to the actual bug.

Tool Execution Details

Each tool was run using the GitHub CLI on a free trial of its cloud plan. No custom rules were configured for any of the tools. All reviews used the tool's default settings to reflect the experience a developer would get immediately upon signing up.

Reviews were triggered by opening a PR or calling the bot manually in a clean forked repo for each tool. All tools were given access to the full repository, including the PR diffs and base branches.

Test Design

Each bug was assigned a severity level—Critical, High, Medium, or Low—based on its potential impact if pushed to production:

Critical

Prevents the application from compiling or running, introduces major security vulnerabilities, or breaks core product functionality.

High

Does not crash the system but introduces serious risks—data corruption, deadlocks, broken features, or widespread access failures.

Medium

Affects performance, monitoring accuracy, or non-critical behavior. May cause confusion without breaking core functionality.

Low

Minor or cosmetic issues such as incorrect log levels, missing React keys, or inconsistent styles.

A bug was considered "caught" only if it was directly mentioned in an individual comment within the PR. Mentions in the summary alone did not count. While most bugs were isolated to a single file, they often had broader implications across the codebase. To receive credit, a tool needed to flag the faulty code directly and explain the impact of the issue through a line-level comment.

Limitations

This evaluation was conducted in July 2025. As these tools are rapidly evolving, their performance may change over time.

While tools often commented on additional issues within each PR, only their ability to detect the original, known bug was considered in the scoring. False positives, style suggestions, and unrelated comments were not evaluated when determining catch rate.

Performance

After running the five tools on the set of 50 PRs, the evaluation revealed differences in how each AI code review tool catches bugs, balances signal and noise, handles varying programming languages, and contributes to the overall quality of the code review process.

Bug Catch Performance

Here are the complete results from our evaluation of 50 real-world bugs across five AI code review tools:

Bug Detection Performance

82%

Greptile

41/50 bugs

58%

Cursor

29/50 bugs

54%

Copilot

27/50 bugs

44%

CodeRabbit

22/50 bugs

Graphite

3/50 bugs

Greptile significantly outperformed all other tools with an 82% catch rate, detecting 41% more bugs than the second-place tool, Cursor (58%). The results show a clear performance hierarchy: Greptile leads decisively, followed by Cursor and Copilot in the mid-50% range, CodeRabbit at 44%, while Graphite fell far behind with only 6% detection capability across all bug types.

Catch Rate by Severity

To understand how each tool performs across different types of risk, the chart below shows catch rates broken down by severity level. This highlights whether tools were more effective at catching critical issues or skewed toward lower-impact problems.

Bug Detection by Severity Level

43%

Critical

Avg catch rate

51%

High

Avg catch rate

58%

Medium

Avg catch rate

56%

Low

Avg catch rate

Greptile was the top performer, catching 58.3% of critical bugs and 100% of high severity issues, with 88.9% accuracy on medium severity. Cursor followed closely, matching Greptile on critical bugs caught (58.3%) and scoring 64.3% on high severity. Copilot excelled on medium severity (77.8%) but lagged on high severity (57.1%). CodeRabbit delivered steady but lower performance, strongest on low severity bugs (53.3%) and weakest on critical catches (33.3%). Graphite failed to catch any high or low severity issues, but performed strongest on critical issues.

Overall Experience

The overall experience reflects my hands-on use of each tool. While the criteria are subjective, they were applied consistently. Noise was based on the number of comments. Comment quality measured how clear and actionable each in-line comment was. Summary usefulness captured how well the tool summarized the PR. Ease of setup reflected how smooth the installation and configuration process was. Average wait time was measured from PR open to review completion.

Tool	Noise	Comment Quality	Summary	Setup	Avg Wait
Greptile	High	★★★☆Good	★★★★Excellent	★★★★Excellent	288s
Copilot	High	★★★☆Good	★★☆☆Fair	★☆☆☆Subpar	29s
CodeRabbit	Moderate	★★★☆Good	★★★★Excellent	★★★★Excellent	206s
Cursor	Low	★★☆☆Fair	N/AN/A	★★☆☆Fair	164s
Graphite	Low	★★☆☆Fair	N/AN/A	★★☆☆Fair	48s

Greptile and CodeRabbit were the only tools to consistently deliver helpful summaries and high-quality comments, but both came with longer wait times. Copilot was fast but often lacked depth and made a lot of unhelpful comments. Cursor made good catches but was very selective with its comments often combining multiple issues into one comment. Graphite was by far the least noisy, leaving only eight comments across all tests.

Findings by Repo

Performance varied noticeably across repositories, suggesting that some tools are better suited to certain languages, frameworks, or codebase structures. For each repo, you can find information on the PRs, along with whether each tool successfully identified the issue.

In the catch tables, each tool name links to its corresponding repository, where you can explore the comments, summaries, and overall output it generated for each PR.

PR Title	Bug Description	Greptile	Copilot	CodeRabbit	Cursor	Graphite
Enhanced Pagination Performance for High-Volume Audit Logs	Importing non-existent `OptimizedCursorPaginator` High — Immediate runtime failure
Optimize spans buffer insertion with eviction during insert	Negative offset cursor manipulation bypasses pagination boundaries Critical — Security vulnerability
Support upsampled error count with performance optimizations	`sample_rate = 0.0` is falsy and skipped Low — Affects test utilities only
GitHub OAuth Security Enhancement	Null reference if `github_authenticated_user` state is missing Critical — Crashes in production
Replays Self-Serve Bulk Delete System	Breaking changes in error response format Critical — Breaks existing API consumers
Span Buffer Multiprocess Enhancement with Health Monitoring	Inconsistent metric tagging with 'shard' and 'shards' Medium — Hinders monitoring/debugging
Implement cross-system issue synchronization	Shared mutable default in dataclass timestamp Medium — Unexpected shared state
Reorganize incident creation / issue occurrence logic	Using stale config variable instead of updated one High — Uses stale configuration
Add ability to use queues to manage parallelism	Invalid `queue.ShutDown` exception handling High — Message loss from unhandled failures
Add hook for producing occurrences from the stateful detector	Incomplete implementation (only contains `pass`) High — Missing core logic
Total Catches		8/10	4/10	3/10	4/10	0/10

AI CODE REVIEW
EVALUATION (2025)

Overview

Methodology

Test Repositories

Python

TypeScript

Go

Java

Ruby

Tool Execution Details

Test Design

Critical

High

Medium

Low

Limitations

Performance

Bug Catch Performance

Bug Detection Performance

Catch Rate by Severity

Bug Detection by Severity Level

Overall Experience

Findings by Repo

Product

Company

Helpful Links

AI CODE REVIEWEVALUATION (2025)

Overview

Methodology

Test Repositories

Python

TypeScript

Go

Java

Ruby

Tool Execution Details

Test Design

Critical

High

Medium

Low

Limitations

Performance

Bug Catch Performance

Bug Detection Performance

Catch Rate by Severity

Bug Detection by Severity Level

Overall Experience

Findings by Repo

Product

Company

Helpful Links

AI CODE REVIEW
EVALUATION (2025)