Greptile Research

AI CODING REPORT

A cross-industry study on recent trends in AI software development.

Table of Contents

Navigate the Report

Section 01

Engineering Team Velocity

Measuring productivity gains across development workflows.

Chart 1.1
PRs Are Getting Bigger

Median PR size increased 33% from March to November 2025, rising from 57 to 76 lines changed per PR.

Chart 1.2
Developer Output Up 76%

Lines of code per developer grew from 4,450 to 7,839 as AI coding tools act as a force multiplier.

Chart 1.3
Medium Teams Output +89%

Medium teams (6-15 devs) increased output from 7,005 to 13,227 lines per developer.

Chart 1.4
Lines Changed Per File Up 20%

Median lines changed per file grew from 18 to 22 as PRs become denser.


Section 02

AI Tool Adoption

Tracking the rise of AI-powered development tools.

Chart 2.1
AI Memory Packages

mem0 dominates with 59% market share. The clear leader in AI memory infrastructure.

PyPI + npm monthly downloads, Nov 2025

Chart 2.2
Vector DB Market Share

No clear winner. Weaviate leads at 25%, but 6 players sit between 10-25% share.

PyPI + npm monthly downloads, Nov 2025

Chart 2.3
AI Rules Files

CLAUDE.md leads adoption at 67%. Most teams use multiple formats.

17% of repos use all three formats

Chart 2.4
AI SDK Growth

Anthropic SDK leads at 43M (8x growth). Pydantic AI explodes 3.7x to 6M.

PyPI + npm monthly downloads, Apr–Nov 2025

Chart 2.5
LLMOps Top 5

LangSmith dominates at 110M. Helicone included for comparison at 5.5K.

PyPI + npm monthly downloads, Jun–Nov 2025

LangSmith is bundled with LangChain installs


Section 03

Model Growth Trends

How AI models have scaled and evolved.

Chart 3.1
LLM Provider SDK Downloads

OpenAI leads at 130M. Anthropic grew 1,547x since Apr 2023. Google trails at 13.6M.

PyPI monthly downloads, Jan 2022–Nov 2025

Chart 3.2
The Gap is Closing

OpenAI-to-Anthropic ratio dropped from 47:1 (Jan 2024) to 4.2:1 (Nov 2025).

Peak: 47:1 (Jan 2024)
Now: 4.2:1

PyPI monthly downloads ratio, Jul 2023–Nov 2025


Section 04

Model Snapshot

Model benchmarks for GPT-5.1, Claude Sonnet 4.5, GPT-5-Codex, Claude Opus 4.5, and Gemini 3 Pro to understand how they behave as backends for coding agents across latency, throughput, rate limits, cold starts, cost, and tokenization efficiency.

Test Setup

Each model ran through the same six test suites with identical parameters:

Shared parameters
temperature = 0.2, top_p = 1.0, max_tokens = 1024
Exponential backoff on retryable errors (429, 5xx) with delays of 0.2s, 0.4s, and 0.8s
All models saw the same prompt set under the same protocol
01
Latency suite

100 streaming requests across 50 coding prompts. Measured p95 time-to-first-token (TTFT), p95 end-to-end completion time (E2E), and jitter (inter-chunk gaps). Three warmup requests preceded measurement.

02
Throughput suite

16 concurrent workers running for 60 seconds. Measured total completion tokens across all workers to estimate aggregate tokens per second at steady state.

03
Rate limit suite

Concurrency ramped from 2 to 32 workers in stages. Test ended when error rate exceeded 2% (typically 429s). Reported tokens per minute at the last stage where the model handled load without degradation.

04
Cold start suite

10-minute idle period followed by 20 requests. Compared cold vs warm TTFT to see whether intermittent usage carries a latency penalty.

05
Cost analysis

Synthetic workload of 8,000 prompt tokens and 1,000 completion tokens. Cost calculated from public provider pricing (not metered API bills).

06
Tokenization efficiency

Mixed-language code blob measured as characters per token. Higher ratios mean more code fits into a fixed context window.

Results Overview

A comprehensive comparison of all models across key performance metrics.

Model
Provider
TTFT p25
TTFT p50
TTFT p75
Throughput p25
Throughput p50
Throughput p75
GPT-5-CodexOpenAI3.7 s5.0 s6.6 s53 tok/s62 tok/s73 tok/s
GPT-5.1OpenAI3.9 s5.5 s7.6 s55 tok/s62 tok/s68 tok/s
Sonnet 4.5Anthropic1.8 s2.0 s2.2 s17 tok/s19 tok/s21 tok/s
Opus 4.5Anthropic1.9 s2.2 s3.0 s14 tok/s18 tok/s20 tok/s
Gemini 3 ProGoogle11.8 s13.1 s14.5 s4 tok/s4 tok/s5 tok/s

Latency

Anthropic's Opus 4.5 and Sonnet 4.5 sit in a different latency class, finishing full responses several seconds ahead of any OpenAI or Google model in this set. In practice that means multi-step agent chains turn over faster and human reviewers stay in flow, instead of burning an extra 10+ seconds per hop waiting for GPT-5 or Gemini to finish streaming.
Latency Distribution

Throughput vs Rate Limit

OpenAI's GPT-5 Codex and GPT-5.1 are the only models that push their rate limits. Their throughput climbs closer to the quota meaning you can keep far more coding agents or CI jobs running in parallel before throttling, while Anthropic and Gemini hit backoff sooner and require stricter queueing.
Realized Throughput vs Rate-Limit Ceiling

Cost & Tokenization

The key patterns are the multipliers, not the absolute prices:

Cost Multiplier

GPT-5, Codex & Gemini
Sonnet 4.5
Opus 4.1
10×

Additional Behavioral Observations

A few model behaviors aren't obvious from the main table but showed up in the suites:

Cold Start Behavior
+2.385s
penalty
Maximum cold-start penalty (Gemini 3 Pro)
< 40ms
Cold vs warm TTFT for all other models (effectively no penalty)
Anthropic Tier Consistency
16.3
chars/token
Sonnet & Opus code tokenization (identical density)
≈147k
Average Anthropic rate limit (Sonnet & Opus)
0.0%
Error rate in our tests
OpenAI Utilization Under Load
≈23%
of quota
Average OpenAI throughput with only 16 workers
≈11–14%
Anthropic models at the same settings
Gemini Jitter
≈8ms
jitter
Gemini 3 Pro p95 jitter (near-perfectly even token spacing)
≈100–320ms
p95 jitter range for the other models

These details may matter for specific environments (e.g., systems that are sensitive to jitter or cold-start behavior).


Section 05

Research & Content

Surfacing recent research that shaped how 2025 tools handle scale, context, and agents, so teams can interpret and apply to their own systems.

Foundational Model Advances

DeepSeek-V3 Technical Report

DeepSeek-V3 is a 671B-parameter Mixture-of-Experts model that activates only 37B parameters per token, focusing on efficiency rather than raw size. The report shows how architectural choices can narrow the gap with much larger dense models.

Multi-Head Latent Attention compresses key/value representations into small latent vectors, shrinking KV caches and easing memory pressure.
Sparse MoE routing activates only a few experts per token and limits cross-node communication to keep GPUs fully utilized.
Multi-Token Prediction adds auxiliary targets per token, increasing learning signal density during training.
Overall, the model treats scale as a data-flow and memory-management problem, not just a parameter-count problem.
Qwen2.5-Omni Technical Report

Qwen2.5-Omni is a multimodal model that separates perception (audio/vision encoders) from sequence modeling (a shared language model), with an emphasis on stable, real-time text–audio–video reasoning.

Time-aligned Multimodal RoPE (TMRoPE) synchronizes audio and video via consistent temporal position embeddings.
Encoders process inputs in blocks, while a central language model handles long-range reasoning and context.
A Thinker–Talker architecture splits responsibilities: Thinker does text reasoning, Talker turns internal representations into streaming speech.
The design highlights that decoupling perception, reasoning, and generation can make multimodal systems easier to scale and debug.
Long Context vs. RAG for LLMs: An Evaluation and Revisits

This paper systematically compares long-context (LC) models and RAG across 12 QA datasets and ~19k questions, focusing on how each approach handles external information.

LC tends to outperform RAG on continuous, well-structured sources (books, wiki articles) and precise fact-based questions.
RAG tends to win on fragmented, multi-source, and dialogue-heavy data, especially under loose F1-style scoring.
Summarization-based retrieval performs similarly to LC, while simple chunk-based retrieval falls behind.
The core claim: LC and RAG succeed under different structural assumptions about where relevant information lives.
Does RAG Really Perform Bad for Long Context?

RetroLM introduces KV-level retrieval for long-context tasks, treating the KV cache as the retrieval surface instead of raw text.

Inputs are split into fixed-size KV "pages" with bookmark tokens summarizing each page.
A trained page retriever selects important KV pages per layer; offloaded pages live off-device and are pulled back on demand.
Across LongBench, InfiniteBench, and RULER, RetroLM beats standard RAG pipelines and other efficient long-context methods.
The framework reframes retrieval as selecting which cached representations to keep, rather than which raw tokens to stuff into the prompt.
Rethinking Mixture-of-Agents

Self-MoA examines whether diverse model ensembles are actually necessary for strong Mixture-of-Agents performance.

Standard MoA queries multiple different models and aggregates their answers; Self-MoA instead repeatedly samples a single strong model.
An aggregator LLM combines multiple responses from that one model, trading cross-model diversity for in-model diversity.
Experiments on AlpacaEval 2.0 and other benchmarks show Self-MoA outperforming traditional MoA when proposer quality is high.
A sequential variant, Self-MoA-Seq, aggregates in sliding windows to stay within context limits while scaling the number of samples.

Application-Layer Innovations

GEPA: Reflective Prompt Evolution Can Outperform RL

GEPA (Genetic-Pareto) is a reflective prompt-evolution method that optimizes instructions using execution traces instead of updating model weights.

The system samples rollouts, has the model analyze its own traces in natural language, and proposes new prompts.
A Pareto front keeps multiple candidate prompts that perform well on different subsets of data.
Across four tasks, GEPA matches or beats GRPO-style RL with up to 35× fewer rollouts.
The work treats prompts as an external optimization layer, using natural-language reflection rather than heavyweight RL.
SFR-DeepResearch: Single-Agent RL for Deep Web Research

SFR-DeepResearch (SFR-DR) is a reinforcement-learning framework for training a single web-research agent that decides when to search, browse, or execute code.

The agent uses three minimal tools—search_internet, browse_page, stateless code_interpreter—designed to force explicit reasoning.
A self-managed memory tool (clean_memory) lets the agent control long-horizon context instead of passively appending everything.
Length-normalized RL stabilizes multi-step optimization and prevents degenerate, repetitive tool use.
Results on Humanity's Last Exam and related benchmarks highlight that context management and planning are the core bottlenecks, not just model size.
Beyond RAG vs Long-Context

LDAR (Learning Distraction-Aware Retrieval) targets the performance drop that occurs when relevant passages are mixed with noisy context.

A small Transformer operates purely on similarity-score distributions of candidate passages, predicting a lower and upper similarity bound.
Retrieval becomes selecting a continuous "band" of passages, instead of top-k or independent Bernoulli decisions.
LDAR uses 25–63% of the tokens of long-context baselines while maintaining or improving performance.
The central claim is that context quality and distraction-awareness matter more than raw context size, especially on noisy corpora.
MEM1: Constant-Memory Long-Horizon Agents

MEM1 is an RL framework that trains LLM agents to operate over long multi-turn tasks while keeping memory usage nearly constant.

At each step, previous memory and new observations are merged into a compact internal state token (&lt;IS&gt;), and older context is discarded.
A masked-trajectory RL scheme reconstructs valid trajectories for PPO without feeding the entire history.
MEM1-7B matches or beats much larger baselines on tasks with up to 16 sequential objectives while reducing memory use by ~3.7×.
The work shows that long-horizon behavior can come from learned internal state handling rather than expanding context windows or bolting on external memory.
Search-R1: Training LLMs to Reason and Search with RL

Search-R1 trains models to interleave step-by-step reasoning with live search-engine queries.

The framework uses a structured template: &lt;think&gt; for internal reasoning, &lt;search&gt; for queries, &lt;information&gt; for retrieved context, and &lt;answer&gt; for final output.
PPO or GRPO updates apply only to model-generated segments, treating the search engine as part of the environment.
Evaluated across seven QA datasets, Search-R1 delivers large gains over strong RAG baselines, including on multi-hop tasks like HotpotQA and 2WikiMultiHopQA.
The paper positions targeted, RL-trained search behavior as an alternative to static top-k retrieval and hand-crafted tool chains.
Greptile

Automatically review PRs with your team's standards