Rise of the Overnight Agents

[ Daksh Gupta | 2026-05-05 ]

navigation|BlogRise of the Overnight Agents

Daksh Gupta - co-founder, Greptile

I'm Daksh, a co-founder of Greptile. We're working on automating code validation, starting with AI agents that review and test PRs. Over the last year we've reviewed several million PRs across 65,000 organizations.

In the nearly two years that Greptile has been reviewing PRs, we have seen the composition of the PRs change dramatically. In 2024, most code was hand-written with some AI autocomplete from Cursor. In 2025, agents had started to work, and programmers were letting Claude Code edit multiple files at once.

In 2026, we observed a step function increase in the adoption of fully end-to-end coding agents, which turned tickets into PRs with no human intervention. In the extreme cases, I saw reports of enthusiastic vibecoders experimenting with polyphasic sleep; their agents code 24/7 and they wake up every now and then to give them their next task.

These anecdotes intrigued me, but I was skeptical that any real work in any real codebases could be done by completely autonomous "overnight" agents.

Being a code review company, we have data on millions of pull requests, any written by agents. I decided to study the data to answer some questions:

  • Were real companies with real codebases coding this way?
  • Were the end-to-end AI-generated PRs any good?
  • In what ways did they fail?

First I had to figure out which PRs were end-to-end AI-generated

Harder than I expected. I figured that if an agent was the "author" of the PR in GitHub, it was likely the agent wrote nearly all the code. Unfortunately fewer than 1% of PRs in my dataset listed a bot (devin-ai-integration[bot], claude[bot]) as the author on GitHub, and I suspected the percentage of vibed PRs was quite a bit more.

Thankfully, it turns out Claude and some others add a Co-Authored-By: (Claude|Devin) as a footer to the PR description if you have them open the PR. Nearly 20% of PRs reviewed by Greptile in March had these footers.

Codex doesn't leave a footer, but the Codex app names branches like codex/<task-slug>. There were 32k codex/-prefixed PRs in March alone. Cursor's background agents (the ones that open PRs autonomously, not the IDE inline-edit feature) tag branches with cursor/.

With all of these signals stacked, I had a good floor for which PRs were likely end-to-end generated by AI.

Sure enough, fully AI-generated PRs have only recently become a large share of total PRs opened. Only 0.86% of PRs in Feb 2025 had evidence of being fully vibed. In April 2026, the share was 27.6%.

[ FIG. 01 / SHARE OF PRs WITH EVIDENCE OF FULL AI AUTHORSHIP ]
0.86%
27.6%
Feb2025
Mar
Apr
May
Jun
Jul
Aug
Sep
Oct
Nov
Dec
Jan2026
Feb
Mar
Apr
month
monthly share
PRs with bot author, co-authored-by footer, or codex/ branch prefix
view table
month% of merged PRs
Feb 20250.86%
Mar 20250.94%
Apr 20251.41%
May 20252.18%
Jun 20252.39%
Jul 20253.27%
Aug 20253.04%
Sep 20255.12%
Oct 20256.18%
Nov 20259.73%
Dec 202511.46%
Jan 202616.02%
Feb 202618.31%
Mar 202624.07%
Apr 202627.60%

Next, I had to find a way to measure PR quality.

Method 1: Using revert rates

My first heuristic was that a PR was probably not great if it had to be reverted. GitHub's auto-revert format is revert-<orig-pr-number>-<orig-branch>, so each revert pins back to the original PR. Mar 15 – Apr 14, 2026:

[ FIG. 02 / REVERTS PER 1,000 MERGED PRs ]
Codex
human
1.19
Claude
1.80
human
2.72
Cursor BG
3.41
Devin
3.50
01234
reverts per 1,000 merged PRs
below human
human baseline
above human
Mar 15 – Apr 14, 2026
view table
authormergedreverts / 1k
Codex25,9481.19
Claude198,9691.80
human674,8802.72
Cursor BG7,3353.41
Devin11,1313.50

Two of the four agents had lower revert rates than humans. That surprised me, so I started looking for an explanation.

One theory: maybe humans were delegating the easy work to agents and keeping the hard work for themselves. If a developer hands the simple PRs to an agent and writes the complicated ones themselves, the comparison isn't agent vs. human. It's agent on easy work vs. the same person on hard work.

To check this, I needed a proxy for task complexity. The number of files changed felt like a reasonable one. So I compared the median PR sizes for individual developers between PRs that did and didn't have evidence of being end-to-end AI generated.

Same-developer PR sizes (engineers who opened ≥3 AI PRs and ≥3 non-AI PRs in April 2026):

[ FIG. 03 / MEDIAN LOC, SAME DEVELOPERS ]
AI-assistedn = 205,774
171
humann = 210,600
143
050100150200
median LOC per PR
AI-assisted
human
same developers, April 2026
view table
authornmedian filesmedian LOC
AI-assisted (same devs)205,7744171
human (same devs)210,6004143

The opposite. The same developer's AI PRs are about 20% larger by median LOC than their non-AI PRs. Looks like reversion rates for AI-generated PRs are pretty similar to revert-rates for human-written PRs.

I was curious how revert rates correlated with PR size, because I suspected larger PRs that were AI-generated had higher revert rates.

Same developers, revert rate broken out by PR size:

[ FIG. 04 / REVERT RATE BY PR SIZE, SAME DEVS ]
+12%
3.36
2.99
tied
3.02
2.99
-17%
2.38
2.88
-28%
2.57
3.57
1 file
2–3 files
4–10 files
10+ files
files changed
AI worse
tied
AI better
human
same developers, April 2026
view table
files changedAI revert/1khuman revert/1kAI vs human
13.362.99+12% worse
2–33.022.99tied
4–102.382.88−17%
10+2.573.57−28%

Wrong again. For larger PRs, revert rates were actually higher for human-written PRs than agent-written PRs.

Method 2: Code churn

Reverts are rare (4 per 1,000). I wanted a denser signal. Churn is the obvious one: if you touch a file and a week later someone else touches it again, that's a form of rework.

I don't have the full diff per PR, but I know which files Greptile commented on. For that subsample (about 40% of merged PRs), I ask: does any file the PR touched get modified by a different author in a later PR within 7 days? Same-author follow-ups are usually iteration ("I'll ship it in four more PRs"), not rework.

Per agent, broken out by PR size, comparing each agent against the human baseline at equal complexity:

[ FIG. 05 / FILE CHURN BY AUTHOR AND PR SIZE ]
1 file
2–3 files
4–10 files
10+ files
mean
Codex
5.6%
6.0%
5.1%
5.9%
5.7%
Claude
8.2%
8.2%
8.3%
7.7%
8.1%
Cursor BG
10.3%
6.7%
8.5%
9.6%
8.8%
human
9.9%
9.8%
10.5%
9.8%
10.0%
Devin
14.0%
13.6%
12.9%
13.3%
13.5%
files changed
−5pp
+5pp
vs. human mean (10.0%)
% files re-edited within 7 days, April 2026
view table
agent1 file2–3 files4–10 files10+ files
Codex5.63%6.03%5.08%5.90%
Claude8.22%8.15%8.26%7.67%
Cursor BG10.32%6.68%8.47%9.57%
human9.89%9.75%10.46%9.81%
Devin13.97%13.58%12.94%13.33%

Codex sits around 5–6% rework, Claude around 8%. Human stays glued to ~10%. Cursor BG and Devin are at or above human in most buckets. The ranking is the same one we saw in revert rate.

Method 3: Greptile comments on AI-generated PRs

Reverts and churn can happen for any number of reasons, some critical and other minor. I wanted some more granularity.

Every PR in my dataset was reviewed by Greptile. Every Greptile comment is tagged:

  • P0 (critical, will break in prod)
  • P1 (real bug, won't necessarily page anyone)
  • P2 (style / nit / wrong-idiom).

I wanted to know if agents produced more P0s than humans.

Issues per 10,000 merged LOC, April 2026:

[ FIG. 06 / GREPTILE FLAGS PER 10K MERGED LOC ]
P0critical, breaks prod
Devin
0.038
Codex
0.041
Claude
0.078
human
0.099
Cursor BG
0.145
0.000.050.100.15
per 10k LOC
P1real bug
Devin
1.47
Codex
1.31
Claude
2.34
human
1.94
Cursor BG
2.83
0.01.02.03.0
per 10k LOC
P2style / nit
Devin
2.64
Codex
2.81
Claude
3.75
human
2.88
Cursor BG
4.80
0.01.02.03.04.05.0
per 10k LOC
rows: author  ·  columns: severity
human baseline
April 2026
view table
agentP0 / 10k LOCP1 / 10k LOCP2 / 10k LOC
Devin0.0381.4702.637
Codex0.0411.3092.809
Claude0.0782.3363.747
human0.0991.9372.877
Cursor BG0.1452.8274.804

A number of agents actually produce fewer P0s than humans. This is also true for P1s and P2s.

Method 4: How many review rounds to merge

Comments per PR is the obvious denser signal but it scales with PR size, so it's hard to read directly. An ostensibly better metric is to see how many review cycles a PR goes through before it is merged. In other words, how much did the agent-generated PR need to be revised before it could be merged.

A PR that opens, gets approved, and merges = 1 cycle. A PR that gets fixes pushed and re-reviewed = 2. And so on.

Intuitively, larger PRs (by lines of code) are revised more often than smaller ones.

Average cycles to merge by PR size (cross-population, all authors, April 2026):

[ FIG. 07 / REVIEW CYCLES BY PR SIZE ]
1.27
1.51
1.91
2.38
2.81
3.54
< 10
10–49
50–199
200–499
500–999
1000+
LOC in PR
cross-population, all authors, April 2026
view table
LOC in PRmean review cycles
< 101.27
10–491.51
50–1991.91
200–4992.38
500–9992.81
1000+3.54
[ FIG. 08 / MEAN REVIEW CYCLES BY AUTHOR ]
Devinn = 6,159
2.11
Clauden = 156,219
2.19
humann = 433,854
2.21
Cursor BGn = 5,691
2.46
Codexn = 20,455
2.46
2.02.22.42.6
mean review cycles to merge (axis 2.0–2.6)
agents
human
April 2026
view table
agentPRsmean cyclesmedianp90
Devin6,1592.1114
Claude156,2192.1914
human433,8542.2114
Cursor BG5,6912.4615
Codex20,4552.4615

Every agents' revisions per PR are fairly consistent with humans.

What are the patterns of failure?

I didn't see any evidence that agent-generated PRs were any worse than human-written PRs. However, I was curious if the patterns of failure were different.

I decided to run a keyword search across Greptile comments left on PRs generated by the various agents. Keywords like "off-by-one" or "sql injection".

Then, I computed how often each agent produced that type of error relative to the human baseline. I have included an assortment of the most interesting rows here:

Note
Agent rate divided by human rate, April 2026. >1 means the agent is flagged for this category more often than a human, per LOC.
[ FIG. 09 / FAILURE-PATTERN HEATMAP ]
Claude
Codex
Devin
Cursor BG
security
sql injection
1.50×
1.25×
0.70×
1.70×
xss
1.57×
0.86×
0.86×
1.43×
auth bypass
1.50×
1.00×
0.50×
1.67×
IDOR / missing tenant check
1.75×
0.88×
0.69×
1.31×
secret in logs
1.34×
1.34×
0.94×
1.65×
correctness
n+1 query
1.27×
0.64×
0.45×
3.45×
regression / breaks existing
1.25×
1.34×
0.89×
2.37×
off-by-one
1.64×
0.55×
0.64×
2.27×
timezone / date bug
1.48×
0.90×
0.66×
2.09×
env var / config bug
1.45×
1.35×
1.35×
0.95×
housekeeping
test missing
0.96×
1.13×
0.93×
2.37×
dead code
1.14×
0.99×
0.78×
2.05×
stale comment / wrong doc
1.69×
0.38×
0.88×
0.69×
agent
0.2×
3.0×+
1.0× = human rate
agent rate ÷ human rate, per LOC, April 2026
view table
categoryClaudeCodexDevinCursor BG
n+1 query1.27×0.64×0.45×3.45×
regression / breaks existing1.25×1.34×0.89×2.37×
test missing0.96×1.13×0.93×2.37×
off-by-one1.64×0.55×0.64×2.27×
timezone / date bug1.48×0.90×0.66×2.09×
dead code1.14×0.99×0.78×2.05×
sql injection1.50×1.25×0.70×1.70×
xss1.57×0.86×0.86×1.43×
auth bypass1.50×1.00×0.50×1.67×
IDOR / missing tenant check1.75×0.88×0.69×1.31×
stale comment / wrong doc1.69×0.38×0.88×0.69×
env var / config bug1.45×1.35×1.35×0.95×
secret in logs1.34×1.34×0.94×1.65×

A few things stood out per agent:

  • Cursor BG is the only column where most categories sit above 2× the human rate. Its three highest cells are n+1 query (3.45×), "breaks existing behavior" (2.37×), and missing tests (2.37×). Off-by-one is at 2.27×.

  • Codex's above-human categories cluster around configuration and breakage: env-var / config bugs (1.35×), "breaks existing" (1.34×), and "secret in logs" (1.34×). Most other categories are at or below the human rate.

  • Claude's highest cells are IDOR / missing tenant check (1.75×), stale comment / wrong doc (1.69×), off-by-one (1.64×), and XSS (1.57×). Auth bypass is at 1.50×.

  • Devin is at or below the human rate on every category I tested except env-var / config bugs (1.35×). Its security cells (auth bypass 0.5×, IDOR 0.69×, SQL injection 0.7×) are well below human.

A note on Devin: its low rates here don't square cleanly with its high revert rate from Method 1. The simplest explanation I can offer is that the things Devin gets wrong are not the things this keyword sweep catches - for example, "completed the wrong task" wouldn't show up in any of these categories but would still get reverted. I don't have a way to confirm this from the data.

I think I was wrong

I went in expecting agent PRs to look noticeably worse. Across all four methods, they don't. Two of the four agents had lower revert rates than humans. Every agent except Cursor BG flagged fewer P0s per 10k LOC. Review cycles all clustered within 0.4 of the human mean.

So real codebases are letting these things ship code, and the code is roughly fine. 27.6% of merged PRs in April were end-to-end AI-generated, and that number is still climbing.

The interesting part is what each agent gets wrong. Cursor BG over-indexes on n+1s. Claude on tenancy and auth. Codex on config. The bugs don't disappear, they just move. Whatever review process you've built for human PRs probably wasn't built for those.

Appendix: shortcomings of my methods

  • Adverse selection. While there was no discernable difference in PR size between human and agent PRs, I can't help but feel that humans choosing to hand write a PR for a certain task rather than use an agent might say something about the riskiness of the change.
  • Line-level churn. "What fraction of the lines an agent merged this month are still in the file in three months" is the gold-standard quality signal. I can't compute it because we don't store diffs.
  • Cross-agent contamination. A Codex PR might have had a Claude Code review pass over it before merge; a Claude PR might have had Cursor inline-edits before the headless run. My classifier picks one agent per PR.
  • Agent self-review vs Greptile review. Cursor's background agents and Codex now ship with their own pre-PR review hooks. PRs that were already cleaned up by an in-agent review look better to Greptile than PRs that weren't, and I can't see from the GitHub side whether that pre-review happened.
  • The "human" code. Everyone with a keyboard in 2026 has Cursor's IDE running. It won't surprise me if the "human" PRs are nearly as AI-generated as the ones attributed to an agent.

If you have feedback on my methodology or conclusions, I'd love to hear from you. My email is daksh@greptile.com.





See Greptile in action