The State of AI Coding
A cross-industry study on recent trends in AI software development.
Navigate the Report
Engineering Team Velocity
Measuring productivity gains across development workflows
AI Tool Adoption
Tracking the rise of AI-powered development tools
Model Growth Trends
How AI models have scaled and evolved
Model Snapshot
Performance benchmarks across latency, cost, and tokenization
Research & Content
Recent papers on foundational models and applications
Engineering Team Velocity
Measuring productivity gains across development workflows.
Median PR size increased 33% from March to November 2025, rising from 57 to 76 lines changed per PR.
Captured from Greptile internal data engineering team velocity
Lines of code per developer grew from 4,450 to 7,839 as AI coding tools act as a force multiplier.
Captured from Greptile internal data engineering team velocity
Medium teams (6-15 devs) increased output from 7,005 to 13,227 lines per developer.
Captured from Greptile internal data engineering team velocity
Median lines changed per file grew from 18 to 22 as PRs become denser.
Captured from Greptile internal data engineering team velocity
AI Tool Adoption
Tracking the rise of AI-powered development tools.
mem0 dominates with 59% market share. The clear leader in AI memory infrastructure.
PyPI + npm monthly downloads, Nov 2025
No clear winner. Weaviate leads at 25%, but 6 players sit between 10-25% share.
PyPI + npm monthly downloads, Nov 2025
CLAUDE.md leads adoption at 67%. Most teams use multiple formats.
17% of repos use all three formats
Anthropic SDK leads at 43M (8x growth). Pydantic AI explodes 3.7x to 6M.
PyPI + npm monthly downloads, Apr–Nov 2025
LangSmith dominates at 110M monthly downloads.
PyPI + npm monthly downloads, Jun–Nov 2025
LangSmith is bundled with LangChain installs
Model Growth Trends
How AI models have scaled and evolved.
OpenAI leads at 130M. Anthropic grew 1,547x since Apr 2023. Google trails at 13.6M.
PyPI monthly downloads, Jan 2022–Nov 2025
OpenAI-to-Anthropic ratio dropped from 47:1 (Jan 2024) to 4.2:1 (Nov 2025).
PyPI monthly downloads ratio, Jul 2023–Nov 2025
Model Snapshot
Model benchmarks for GPT-5.1, Claude Sonnet 4.5, GPT-5-Codex, Claude Opus 4.5, and Gemini 3 Pro to understand how they behave as backends for coding agents across latency, throughput, rate limits, cold starts, cost, and tokenization efficiency.
Test Setup
Each model ran through the same six test suites with identical parameters:
TTFT suite
Measured time-to-first-token (TTFT) distribution across requests, reporting p25/p50/p75 percentiles. Three warmup requests preceded measurement.
Throughput suite
Measured aggregate tokens per second, reporting p25/p50/p75 percentiles across test runs.
Results Overview
A comprehensive comparison of all models across key performance metrics.
| Model | Provider | TTFT p25 | TTFT p50 | TTFT p75 | Throughput p25 | Throughput p50 | Throughput p75 |
|---|---|---|---|---|---|---|---|
| GPT-5-Codex | OpenAI | 3.7 s | 5.0 s | 6.6 s | 53 tok/s | 62 tok/s | 73 tok/s |
| GPT-5.1 | OpenAI | 3.9 s | 5.5 s | 7.6 s | 55 tok/s | 62 tok/s | 68 tok/s |
| Sonnet 4.5 | Anthropic | 1.8 s | 2.0 s | 2.2 s | 17 tok/s | 19 tok/s | 21 tok/s |
| Opus 4.5 | Anthropic | 1.9 s | 2.2 s | 3.0 s | 14 tok/s | 18 tok/s | 20 tok/s |
| Gemini 3 Pro | 11.8 s | 13.1 s | 14.5 s | 4 tok/s | 4 tok/s | 5 tok/s |
Time to First Token (TTFT)
Generation Throughput
Cost Multipliers
The key patterns are the multipliers, not the absolute prices. Calculated using public list pricing as of December 15, 2025 for an 8k input / 1k output workload, normalized to GPT-5 Codex = 1× (no caching/batch discounts).
Cost Multiplier
Research & Content
Surfacing recent research that shaped how 2025 tools handle scale, context, and agents, so teams can interpret and apply to their own systems.
Foundational Model Advances
DeepSeek-V3 Technical Report
DeepSeek-V3 is a 671B-parameter Mixture-of-Experts model that activates only 37B parameters per token, focusing on efficiency rather than raw size. The report shows how architectural choices can narrow the gap with much larger dense models.
Qwen2.5-Omni Technical Report
Qwen2.5-Omni is a multimodal model that separates perception (audio/vision encoders) from sequence modeling (a shared language model), with an emphasis on stable, real-time text–audio–video reasoning.
Long Context vs. RAG for LLMs: An Evaluation and Revisits
This paper systematically compares long-context (LC) models and RAG across 12 QA datasets and ~19k questions, focusing on how each approach handles external information.
Does RAG Really Perform Bad for Long Context?
RetroLM introduces KV-level retrieval for long-context tasks, treating the KV cache as the retrieval surface instead of raw text.
Rethinking Mixture-of-Agents
Self-MoA examines whether diverse model ensembles are actually necessary for strong Mixture-of-Agents performance.
Application-Layer Innovations
GEPA: Reflective Prompt Evolution Can Outperform RL
GEPA (Genetic-Pareto) is a reflective prompt-evolution method that optimizes instructions using execution traces instead of updating model weights.
SFR-DeepResearch: Single-Agent RL for Deep Web Research
SFR-DeepResearch (SFR-DR) is a reinforcement-learning framework for training a single web-research agent that decides when to search, browse, or execute code.
Beyond RAG vs Long-Context
LDAR (Learning Distraction-Aware Retrieval) targets the performance drop that occurs when relevant passages are mixed with noisy context.
MEM1: Constant-Memory Long-Horizon Agents
MEM1 is an RL framework that trains LLM agents to operate over long multi-turn tasks while keeping memory usage nearly constant.
<IS>), and older context is discarded.Search-R1: Training LLMs to Reason and Search with RL
Search-R1 trains models to interleave step-by-step reasoning with live search-engine queries.
<think> for internal reasoning, <search> for queries, <information> for retrieved context, and <answer> for final output.