Greptile v3, an agentic approach to code review

I'm Daksh, one of the co-founders of Greptile. At Greptile, our goal is to build the agents that can autonomously validate code. Our first step toward that goal is to build a really great code reviewer for merge requests.

Recently, we rolled out v3, a complete rewrite of our code review workflow. In this post, I'm going to go into some detail around how v3 works.

Performance results

Now that v3 has reviewed over 1B lines of code since launch, we have comprehensive data to show v3 is far better than v2 along every useful axis.

Metric	v2	v3	% Change
Upvote/Downvote Ratio	1.44	5.13	+256%
Upvotes per 10K Comments	109	183	+68%
Action Rate (%)	34.75	59.24	+70.5%

Limitations of v2

To understand what makes v3 work so much better than v2, we must first understand how v2 worked.

Crudely, v2 was a flowchart as shown in this image. The workflow receives the PR diff and metadata, has a well-defined codebase context step, and then a well-defined external context step. Lastly, it produces review comments.

Greptile v2 workflow flowchart showing the linear process from PR diff to review comments

For last-gen LLMs from GPT-4 to Claude 3.5 Sonnet, this was the most powerful way to use LLMs. Give them a series of well-defined tasks. For some of the tasks, provide data from an external source as needed.

There is an obvious limitation here, among others. The rigidity of the flowchart prevents the system from using new information that it gets from the search step. Here's an example to illustrate why this is a problem:

System is reviewing a file changing the login button onClick action
System uses codebase search to find the file where the onClick function is defined
Turns out the onClick function calls a function in a third file
System will never see that third file, it's moving on to the next step in the flowchart

A "detective" approach to code review

In v3, we introduced a new approach to code review. We let the system run in a loop, with access to some key tools such as codebase search and accessing learned rules. The system has a very high limit on how many times it can run LLM inference or access tools, so it can continue recursively searching the codebase to follow nested function calls and do multi-hop "thinking".

Greptile v3 agentic workflow showing the loop-based approach with codebase search and learned rules

Let's take an example of how this works.

Consider a PR in which a developer updates calculateInvoiceTotal() to handle discounts differently.
Greptile's agent expands beyond the diff and searches the entire codebase for similar logic. It finds three related implementations, including a nested call path inside generateMonthlyStatement().
In that deeper chain, it spots something off: applyProration() still uses the old discount formula.
The agent checks git history and discovers this helper was created during an old hotfix and never refactored.
With the pattern mismatch + stale logic + historical context, Greptile raises a targeted comment: "This nested call still uses the previous discount rules; updating it will prevent inconsistent totals across invoices and statements."

The performance improvements are quite significant. The obvious one is greater accuracy. Naturally, with ways to do long-running explorations of the diffs and the codebase, v3 simply catches more bugs.

A second, emergent effect is that higher precision, or in other words, a higher signal-to-noise ratio. Based on our study, the reason for this is likely an increased threshold for "sureness" since v3 can challenge its own hypothesis more strongly. Naturally, this means lower confidence comments can be safely eliminated. The acceptance rate for v3 is 70.5% higher than v2.

As a side bonus, v3 uses caching far more effectively. In spite of using more context tokens than v2 (around 3 times more), it actually has 75% lower inference costs for our self-hosted customers, thanks to extremely high cache hit rates.

Greptile v3, an agentic approach to code review

Table of Contents

Table of Contents

Performance results

Limitations of v2

A "detective" approach to code review

Keep Reading

Software Needs An Independent Auditor

Do Larger PRs Get Merged Faster?

Every GitHub Object Has Two IDs

Subscribe to our blog

See Greptile in action

Product

Company

Helpful Links