AI CODING REPORT
A cross-industry study on recent trends in AI software development.
Navigate the Report
Engineering Team Velocity
Measuring productivity gains across development workflows
AI Tool Adoption
Tracking the rise of AI-powered development tools
Model Growth Trends
How AI models have scaled and evolved
Model Snapshot
Performance benchmarks across latency, cost, and tokenization
Research & Content
Recent papers on foundational models and applications
Engineering Team Velocity
Measuring productivity gains across development workflows.
Median PR size increased 33% from March to November 2025, rising from 57 to 76 lines changed per PR.
Lines of code per developer grew from 4,450 to 7,839 as AI coding tools act as a force multiplier.
Medium teams (6-15 devs) increased output from 7,005 to 13,227 lines per developer.
Median lines changed per file grew from 18 to 22 as PRs become denser.
AI Tool Adoption
Tracking the rise of AI-powered development tools.
mem0 dominates with 59% market share. The clear leader in AI memory infrastructure.
PyPI + npm monthly downloads, Nov 2025
No clear winner. Weaviate leads at 25%, but 6 players sit between 10-25% share.
PyPI + npm monthly downloads, Nov 2025
CLAUDE.md leads adoption at 67%. Most teams use multiple formats.
17% of repos use all three formats
Anthropic SDK leads at 43M (8x growth). Pydantic AI explodes 3.7x to 6M.
PyPI + npm monthly downloads, Apr–Nov 2025
LangSmith dominates at 110M. Helicone included for comparison at 5.5K.
PyPI + npm monthly downloads, Jun–Nov 2025
LangSmith is bundled with LangChain installs
Model Growth Trends
How AI models have scaled and evolved.
OpenAI leads at 130M. Anthropic grew 1,547x since Apr 2023. Google trails at 13.6M.
PyPI monthly downloads, Jan 2022–Nov 2025
OpenAI-to-Anthropic ratio dropped from 47:1 (Jan 2024) to 4.2:1 (Nov 2025).
PyPI monthly downloads ratio, Jul 2023–Nov 2025
Model Snapshot
Model benchmarks for GPT-5.1, Claude Sonnet 4.5, GPT-5-Codex, Claude Opus 4.5, and Gemini 3 Pro to understand how they behave as backends for coding agents across latency, throughput, rate limits, cold starts, cost, and tokenization efficiency.
Test Setup
Each model ran through the same six test suites with identical parameters:
Latency suite
100 streaming requests across 50 coding prompts. Measured p95 time-to-first-token (TTFT), p95 end-to-end completion time (E2E), and jitter (inter-chunk gaps). Three warmup requests preceded measurement.
Throughput suite
16 concurrent workers running for 60 seconds. Measured total completion tokens across all workers to estimate aggregate tokens per second at steady state.
Rate limit suite
Concurrency ramped from 2 to 32 workers in stages. Test ended when error rate exceeded 2% (typically 429s). Reported tokens per minute at the last stage where the model handled load without degradation.
Cold start suite
10-minute idle period followed by 20 requests. Compared cold vs warm TTFT to see whether intermittent usage carries a latency penalty.
Cost analysis
Synthetic workload of 8,000 prompt tokens and 1,000 completion tokens. Cost calculated from public provider pricing (not metered API bills).
Tokenization efficiency
Mixed-language code blob measured as characters per token. Higher ratios mean more code fits into a fixed context window.
Results Overview
A comprehensive comparison of all models across key performance metrics.
Model | Provider | TTFT p25 | TTFT p50 | TTFT p75 | Throughput p25 | Throughput p50 | Throughput p75 |
|---|---|---|---|---|---|---|---|
| GPT-5-Codex | OpenAI | 3.7 s | 5.0 s | 6.6 s | 53 tok/s | 62 tok/s | 73 tok/s |
| GPT-5.1 | OpenAI | 3.9 s | 5.5 s | 7.6 s | 55 tok/s | 62 tok/s | 68 tok/s |
| Sonnet 4.5 | Anthropic | 1.8 s | 2.0 s | 2.2 s | 17 tok/s | 19 tok/s | 21 tok/s |
| Opus 4.5 | Anthropic | 1.9 s | 2.2 s | 3.0 s | 14 tok/s | 18 tok/s | 20 tok/s |
| Gemini 3 Pro | 11.8 s | 13.1 s | 14.5 s | 4 tok/s | 4 tok/s | 5 tok/s |
Latency
Throughput vs Rate Limit
Cost & Tokenization
The key patterns are the multipliers, not the absolute prices:
Cost Multiplier
Additional Behavioral Observations
A few model behaviors aren't obvious from the main table but showed up in the suites:
These details may matter for specific environments (e.g., systems that are sensitive to jitter or cold-start behavior).
Research & Content
Surfacing recent research that shaped how 2025 tools handle scale, context, and agents, so teams can interpret and apply to their own systems.
Foundational Model Advances
DeepSeek-V3 is a 671B-parameter Mixture-of-Experts model that activates only 37B parameters per token, focusing on efficiency rather than raw size. The report shows how architectural choices can narrow the gap with much larger dense models.
Qwen2.5-Omni is a multimodal model that separates perception (audio/vision encoders) from sequence modeling (a shared language model), with an emphasis on stable, real-time text–audio–video reasoning.
This paper systematically compares long-context (LC) models and RAG across 12 QA datasets and ~19k questions, focusing on how each approach handles external information.
RetroLM introduces KV-level retrieval for long-context tasks, treating the KV cache as the retrieval surface instead of raw text.
Self-MoA examines whether diverse model ensembles are actually necessary for strong Mixture-of-Agents performance.
Application-Layer Innovations
GEPA (Genetic-Pareto) is a reflective prompt-evolution method that optimizes instructions using execution traces instead of updating model weights.
SFR-DeepResearch (SFR-DR) is a reinforcement-learning framework for training a single web-research agent that decides when to search, browse, or execute code.
LDAR (Learning Distraction-Aware Retrieval) targets the performance drop that occurs when relevant passages are mixed with noisy context.
MEM1 is an RL framework that trains LLM agents to operate over long multi-turn tasks while keeping memory usage nearly constant.
<IS>), and older context is discarded.Search-R1 trains models to interleave step-by-step reasoning with live search-engine queries.
<think> for internal reasoning, <search> for queries, <information> for retrieved context, and <answer> for final output.