Rise of the Overnight Agents

Daksh Gupta - co-founder, Greptile

I'm Daksh, a co-founder of Greptile. We're working on automating code validation, starting with AI agents that review and test PRs. Over the last year we've reviewed several million PRs across 65,000 organizations.

In the nearly two years that Greptile has been reviewing PRs, we have seen the composition of the PRs change dramatically. In 2024, most code was hand-written with some AI autocomplete from Cursor. In 2025, agents had started to work, and programmers were letting Claude Code edit multiple files at once.

In 2026, we observed a step function increase in the adoption of fully end-to-end coding agents, which turned tickets into PRs with no human intervention. In the extreme cases, I saw reports of enthusiastic vibecoders experimenting with polyphasic sleep; their agents code 24/7 and they wake up every now and then to give them their next task.

These anecdotes intrigued me, but I was skeptical that any real work in any real codebases could be done by completely autonomous "overnight" agents.

Being a code review company, we have data on millions of pull requests, any written by agents. I decided to study the data to answer some questions:

Were real companies with real codebases coding this way?
Were the end-to-end AI-generated PRs any good?
In what ways did they fail?

First I had to figure out which PRs were end-to-end AI-generated

Harder than I expected. I figured that if an agent was the "author" of the PR in GitHub, it was likely the agent wrote nearly all the code. Unfortunately fewer than 1% of PRs in my dataset listed a bot (devin-ai-integration[bot], claude[bot]) as the author on GitHub, and I suspected the percentage of vibed PRs was quite a bit more.

Thankfully, it turns out Claude and some others add a Co-Authored-By: (Claude|Devin) as a footer to the PR description if you have them open the PR. Nearly 20% of PRs reviewed by Greptile in March had these footers.

Codex doesn't leave a footer, but the Codex app names branches like codex/<task-slug>. There were 32k codex/-prefixed PRs in March alone. Cursor's background agents (the ones that open PRs autonomously, not the IDE inline-edit feature) tag branches with cursor/.

With all of these signals stacked, I had a good floor for which PRs were likely end-to-end generated by AI.

Sure enough, fully AI-generated PRs have only recently become a large share of total PRs opened. Only 0.86% of PRs in Feb 2025 had evidence of being fully vibed. In April 2026, the share was 27.6%.

[ FIG. 01 / SHARE OF PRs WITH EVIDENCE OF FULL AI AUTHORSHIP ]

% of merged PRs

30%

20%

10%

0.86%

27.6%

Feb2025

Mar

Apr

May

Jun

Jul

Aug

Sep

Oct

Nov

Dec

Jan2026

Feb

Mar

Apr

month

monthly share

PRs with bot author, co-authored-by footer, or codex/ branch prefix

view table

month	% of merged PRs
Feb 2025	0.86%
Mar 2025	0.94%
Apr 2025	1.41%
May 2025	2.18%
Jun 2025	2.39%
Jul 2025	3.27%
Aug 2025	3.04%
Sep 2025	5.12%
Oct 2025	6.18%
Nov 2025	9.73%
Dec 2025	11.46%
Jan 2026	16.02%
Feb 2026	18.31%
Mar 2026	24.07%
Apr 2026	27.60%

Next, I had to find a way to measure PR quality.

Method 1: Using revert rates

My first heuristic was that a PR was probably not great if it had to be reverted. GitHub's auto-revert format is revert-<orig-pr-number>-<orig-branch>, so each revert pins back to the original PR. Mar 15 – Apr 14, 2026:

[ FIG. 02 / REVERTS PER 1,000 MERGED PRs ]

author

Codex

human

1.19

Claude

1.80

human

2.72

Cursor BG

3.41

Devin

3.50

01234

reverts per 1,000 merged PRs

below human

human baseline

above human

Mar 15 – Apr 14, 2026

view table

author	merged	reverts / 1k
Codex	25,948	1.19
Claude	198,969	1.80
human	674,880	2.72
Cursor BG	7,335	3.41
Devin	11,131	3.50

Two of the four agents had lower revert rates than humans. That surprised me, so I started looking for an explanation.

One theory: maybe humans were delegating the easy work to agents and keeping the hard work for themselves. If a developer hands the simple PRs to an agent and writes the complicated ones themselves, the comparison isn't agent vs. human. It's agent on easy work vs. the same person on hard work.

To check this, I needed a proxy for task complexity. The number of files changed felt like a reasonable one. So I compared the median PR sizes for individual developers between PRs that did and didn't have evidence of being end-to-end AI generated.

Same-developer PR sizes (engineers who opened ≥3 AI PRs and ≥3 non-AI PRs in April 2026):

[ FIG. 03 / MEDIAN LOC, SAME DEVELOPERS ]

author

AI-assistedn = 205,774

171

humann = 210,600

143

050100150200

median LOC per PR

AI-assisted

human

same developers, April 2026

view table

author	n	median files	median LOC
AI-assisted (same devs)	205,774	4	171
human (same devs)	210,600	4	143

The opposite. The same developer's AI PRs are about 20% larger by median LOC than their non-AI PRs. Looks like reversion rates for AI-generated PRs are pretty similar to revert-rates for human-written PRs.

I was curious how revert rates correlated with PR size, because I suspected larger PRs that were AI-generated had higher revert rates.

Same developers, revert rate broken out by PR size:

[ FIG. 04 / REVERT RATE BY PR SIZE, SAME DEVS ]

reverts per 1,000 merged PRs

4.0

3.0

2.0

1.0

0.0

+12%

3.36

2.99

tied

3.02

2.99

-17%

2.38

2.88

-28%

2.57

3.57

1 file

2–3 files

4–10 files

10+ files

files changed

AI worse

tied

AI better

human

same developers, April 2026

view table

files changed	AI revert/1k	human revert/1k	AI vs human
1	3.36	2.99	+12% worse
2–3	3.02	2.99	tied
4–10	2.38	2.88	−17%
10+	2.57	3.57	−28%

Wrong again. For larger PRs, revert rates were actually higher for human-written PRs than agent-written PRs.

Method 2: Code churn

Reverts are rare (4 per 1,000). I wanted a denser signal. Churn is the obvious one: if you touch a file and a week later someone else touches it again, that's a form of rework.

I don't have the full diff per PR, but I know which files Greptile commented on. For that subsample (about 40% of merged PRs), I ask: does any file the PR touched get modified by a different author in a later PR within 7 days? Same-author follow-ups are usually iteration ("I'll ship it in four more PRs"), not rework.

Per agent, broken out by PR size, comparing each agent against the human baseline at equal complexity:

[ FIG. 05 / FILE CHURN BY AUTHOR AND PR SIZE ]

author (sorted by mean churn)

1 file

2–3 files

4–10 files

10+ files

mean

Codex

5.6%

6.0%

5.1%

5.9%

5.7%

Claude

8.2%

8.3%

7.7%

8.1%

Cursor BG

10.3%

6.7%

8.5%

9.6%

8.8%

human

9.9%

9.8%

10.5%

9.8%

10.0%

Devin

14.0%

13.6%

12.9%

13.3%

13.5%

files changed

−5pp

+5pp

vs. human mean (10.0%)

% files re-edited within 7 days, April 2026

view table

agent	1 file	2–3 files	4–10 files	10+ files
Codex	5.63%	6.03%	5.08%	5.90%
Claude	8.22%	8.15%	8.26%	7.67%
Cursor BG	10.32%	6.68%	8.47%	9.57%
human	9.89%	9.75%	10.46%	9.81%
Devin	13.97%	13.58%	12.94%	13.33%

Codex sits around 5–6% rework, Claude around 8%. Human stays glued to ~10%. Cursor BG and Devin are at or above human in most buckets. The ranking is the same one we saw in revert rate.

Method 3: Greptile comments on AI-generated PRs

Reverts and churn can happen for any number of reasons, some critical and other minor. I wanted some more granularity.

Every PR in my dataset was reviewed by Greptile. Every Greptile comment is tagged:

P0 (critical, will break in prod)
P1 (real bug, won't necessarily page anyone)
P2 (style / nit / wrong-idiom).

I wanted to know if agents produced more P0s than humans.

Issues per 10,000 merged LOC, April 2026:

[ FIG. 06 / GREPTILE FLAGS PER 10K MERGED LOC ]

P0critical, breaks prod

Devin

0.038

Codex

0.041

Claude

0.078

human

0.099

Cursor BG

0.145

0.000.050.100.15

per 10k LOC

P1real bug

Devin

1.47

Codex

1.31

Claude

2.34

human

1.94

Cursor BG

2.83

0.01.02.03.0

per 10k LOC

P2style / nit

Devin

2.64

Codex

2.81

Claude

3.75

human

2.88

Cursor BG

4.80

0.01.02.03.04.05.0

per 10k LOC

rows: author · columns: severity

human baseline

April 2026

view table

agent	P0 / 10k LOC	P1 / 10k LOC	P2 / 10k LOC
Devin	0.038	1.470	2.637
Codex	0.041	1.309	2.809
Claude	0.078	2.336	3.747
human	0.099	1.937	2.877
Cursor BG	0.145	2.827	4.804

A number of agents actually produce fewer P0s than humans. This is also true for P1s and P2s.

Method 4: How many review rounds to merge

Comments per PR is the obvious denser signal but it scales with PR size, so it's hard to read directly. An ostensibly better metric is to see how many review cycles a PR goes through before it is merged. In other words, how much did the agent-generated PR need to be revised before it could be merged.

A PR that opens, gets approved, and merges = 1 cycle. A PR that gets fixes pushed and re-reviewed = 2. And so on.

Intuitively, larger PRs (by lines of code) are revised more often than smaller ones.

Average cycles to merge by PR size (cross-population, all authors, April 2026):

[ FIG. 07 / REVIEW CYCLES BY PR SIZE ]

mean review cycles

4.0

3.0

2.0

1.0

0.0

1.27

1.51

1.91

2.38

2.81

3.54

< 10

10–49

50–199

200–499

500–999

1000+

LOC in PR

cross-population, all authors, April 2026

view table

LOC in PR	mean review cycles
< 10	1.27
10–49	1.51
50–199	1.91
200–499	2.38
500–999	2.81
1000+	3.54

[ FIG. 08 / MEAN REVIEW CYCLES BY AUTHOR ]

author

Devinn = 6,159

2.11

Clauden = 156,219

2.19

humann = 433,854

2.21

Cursor BGn = 5,691

2.46

Codexn = 20,455

2.46

2.02.22.42.6

mean review cycles to merge (axis 2.0–2.6)

agents

human

April 2026

view table

agent	PRs	mean cycles	median	p90
Devin	6,159	2.11	1	4
Claude	156,219	2.19	1	4
human	433,854	2.21	1	4
Cursor BG	5,691	2.46	1	5
Codex	20,455	2.46	1	5

Every agents' revisions per PR are fairly consistent with humans.

What are the patterns of failure?

I didn't see any evidence that agent-generated PRs were any worse than human-written PRs. However, I was curious if the patterns of failure were different.

I decided to run a keyword search across Greptile comments left on PRs generated by the various agents. Keywords like "off-by-one" or "sql injection".

Then, I computed how often each agent produced that type of error relative to the human baseline. I have included an assortment of the most interesting rows here:

Note

Agent rate divided by human rate, April 2026. >1 means the agent is flagged for this category more often than a human, per LOC.

[ FIG. 09 / FAILURE-PATTERN HEATMAP ]

category	Claude	Codex	Devin	Cursor BG
n+1 query	1.27×	0.64×	0.45×	3.45×
regression / breaks existing	1.25×	1.34×	0.89×	2.37×
test missing	0.96×	1.13×	0.93×	2.37×
off-by-one	1.64×	0.55×	0.64×	2.27×
timezone / date bug	1.48×	0.90×	0.66×	2.09×
dead code	1.14×	0.99×	0.78×	2.05×
sql injection	1.50×	1.25×	0.70×	1.70×
xss	1.57×	0.86×	0.86×	1.43×
auth bypass	1.50×	1.00×	0.50×	1.67×
IDOR / missing tenant check	1.75×	0.88×	0.69×	1.31×
stale comment / wrong doc	1.69×	0.38×	0.88×	0.69×
env var / config bug	1.45×	1.35×	1.35×	0.95×
secret in logs	1.34×	1.34×	0.94×	1.65×

I think I was wrong

I went in expecting agent PRs to look noticeably worse. Across all four methods, they don't. Two of the four agents had lower revert rates than humans. Every agent except Cursor BG flagged fewer P0s per 10k LOC. Review cycles all clustered within 0.4 of the human mean.

So real codebases are letting these things ship code, and the code is roughly fine. 27.6% of merged PRs in April were end-to-end AI-generated, and that number is still climbing.

The interesting part is what each agent gets wrong. Cursor BG over-indexes on n+1s. Claude on tenancy and auth. Codex on config. The bugs don't disappear, they just move. Whatever review process you've built for human PRs probably wasn't built for those.

Appendix: shortcomings of my methods

Adverse selection. While there was no discernable difference in PR size between human and agent PRs, I can't help but feel that humans choosing to hand write a PR for a certain task rather than use an agent might say something about the riskiness of the change.
Line-level churn. "What fraction of the lines an agent merged this month are still in the file in three months" is the gold-standard quality signal. I can't compute it because we don't store diffs.
Cross-agent contamination. A Codex PR might have had a Claude Code review pass over it before merge; a Claude PR might have had Cursor inline-edits before the headless run. My classifier picks one agent per PR.
Agent self-review vs Greptile review. Cursor's background agents and Codex now ship with their own pre-PR review hooks. PRs that were already cleaned up by an in-agent review look better to Greptile than PRs that weren't, and I can't see from the GitHub side whether that pre-review happened.
The "human" code. Everyone with a keyboard in 2026 has Cursor's IDE running. It won't surprise me if the "human" PRs are nearly as AI-generated as the ones attributed to an agent.

If you have feedback on my methodology or conclusions, I'd love to hear from you. My email is daksh@greptile.com.

Rise of the Overnight Agents

Contents

Table of Contents

Table of Contents

First I had to figure out which PRs were end-to-end AI-generated

Next, I had to find a way to measure PR quality.

Method 1: Using revert rates

Method 2: Code churn

Method 3: Greptile comments on AI-generated PRs

Method 4: How many review rounds to merge

What are the patterns of failure?

I think I was wrong

Appendix: shortcomings of my methods

Keep Reading

Splitting engineering teams into defense and offense

Why We Refreshed Greptile's Brand

AI Code Review: Should the Author Be The Reviewer?

See Greptile in action