The State of AI Coding

A cross-industry study on recent trends in AI software development.
Q1 2026 update.

Table of Contents

Navigate the Report

Section 01

Engineering Team Velocity

Measuring productivity gains across development workflows.

Chart 1.1
PRs Are Getting Bigger

Median PR size increased 93% from March 2025 to March 2026, rising from 57 to 110 lines changed per PR.

Captured from Greptile internal data engineering team velocity

Chart 1.2
Developer Output

Lines of code per developer grew from 4,450 to 14,148 as AI coding tools act as a force multiplier.

Captured from Greptile internal data engineering team velocity

Chart 1.3
Medium Teams Output

Medium teams (6-15 devs) increased output from 7,005 to 19,715 lines per developer.

Captured from Greptile internal data engineering team velocity

Chart 1.4
Lines Changed Per File

Median lines changed per file grew from 18 to 25 as PRs become denser.

Captured from Greptile internal data engineering team velocity


Section 02

AI Tool Adoption

Tracking the rise of AI-powered development tools.

Chart 2.1
AI Memory Packages

mem0 holds 58% market share. Zep dropped 9pp as smaller players gained ground.

PyPI + npm monthly downloads, Mar 2026

Chart 2.2
Vector DB Market Share

Weaviate extended its lead to 33% (+8pp). The remaining 7 converged between 5-11%.

PyPI + npm monthly downloads, Mar 2026

Chart 2.3
AI Rules Files

CLAUDE.md present in 75% of orgs. AGENTS.md overtook Cursor Rules, down 18pp.

Repos using all three formats dropped from 17% to 6% as teams standardize on fewer formats

Chart 2.4
AI SDK Growth

Anthropic SDK reached 124M downloads in March 2026. OpenAI Agents grew 5x in Q1 2026 to 21M.

PyPI + npm monthly downloads, Apr 2025 – Mar 2026

Chart 2.5
LLMOps Top 5

LiteLLM overtook LangSmith in Q1 2026, hitting 98M monthly downloads.

PyPI + npm monthly downloads, Jun 2025 – Mar 2026

LangSmith is bundled with LangChain installs


Section 03

Model Growth Trends

How AI models have scaled and evolved.

Chart 3.1
LLM Provider SDK Downloads

OpenAI hit 233M in Mar 2026. Anthropic surged to 83M. Google trails at 17M.

PyPI monthly downloads, Jan 2022 – Mar 2026

Chart 3.2
The Gap is Closing

OpenAI-to-Anthropic ratio dropped from 3.7:1 (Dec 2025) to 2.8:1 (Mar 2026).

Peak: 47:1 (Jan 2024)
Now: 2.8:1

PyPI monthly downloads ratio, Jul 2023 – Mar 2026



Section 04

Research & Content

Surfacing recent research that shaped how 2026 tools handle compression, context, multimodality, and long-horizon agents, so teams can interpret and apply it to their own systems.

Foundational Model Advances

TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate

TurboQuant is an online, data-oblivious quantization method for KV-cache compression and vector search that targets both mean-squared error and inner-product distortion.

It randomly rotates vectors so coordinates are easier to quantize, then applies scalar quantizers coordinate-wise instead of relying on slower vector quantization schemes.
To correct the inner-product bias of MSE-optimal quantization, it adds a 1-bit Quantized Johnson-Lindenstrauss step on the residual; the paper reports quality-neutral KV-cache compression at 3.5 bits per channel and stronger recall than product-quantization baselines in vector search.
Long-context inference and retrieval depend on compression schemes that preserve the geometry attention and nearest-neighbor search actually use, not just on storing vectors in fewer bits.

Recursive Language Models

RLMs treat the prompt as external state, letting the model inspect, partition, and recursively call itself over snippets instead of feeding everything through one context window.

The model works inside a Python REPL where the prompt is stored as a variable, letting it inspect slices of the input, execute code, and recursively call itself over selected snippets.
Across CodeQA, BrowseComp-Plus, and OOLONG, RLMs can handle inputs up to two orders of magnitude beyond base context windows and outperform direct model calls and retrieval-style scaffolds.
Long-context scaling may come less from ever-larger windows and more from giving models programmable ways to search, decompose, and recurse over external state.

Titans: Learning to Memorize at Test Time

Titans is a family of long-context architectures that pairs limited-window attention with a neural long-term memory that keeps learning at test time.

Its memory module writes surprising inputs into its weights using gradient-based updates, then uses momentum and weight decay as a learned forgetting mechanism.
The paper instantiates the design as Memory as Context (MAC), Memory as Gating (MAG), and Memory as Layer (MAL); across language, reasoning, needle-in-a-haystack, time series, and genomics tasks, the authors report stronger results than recent recurrent and hybrid baselines, with MAC doing best on longer dependencies.
Overall, long-context modeling may benefit more from explicit write/forget/retrieve memory systems than from relying on ever-larger attention windows alone.

Kimi K2.5: Visual Agentic Intelligence

Kimi K2.5 is an open-source multimodal agentic model that jointly optimizes text and vision, with a parallel-agent execution framework layered on top.

It mixes vision tokens early during pretraining, uses a native-resolution MoonViT-3D encoder for images and video, and follows with zero-vision SFT plus joint multimodal RL.
Agent Swarm uses a trainable orchestrator to spawn frozen specialist subagents and run heterogeneous subtasks concurrently; the paper reports 76.8% on SWE-Bench Verified and large gains on BrowseComp with 3–4.5× lower execution time at target quality.
Agentic performance may come less from a single monolithic model doing everything in sequence and more from jointly trained multimodal foundations paired with learned orchestration over parallel specialist work.

Does RAG Really Perform Bad for Long Context?

RetroLM introduces KV-level retrieval for long-context tasks, treating the KV cache as the retrieval surface instead of raw text.

Inputs are split into fixed-size KV "pages" with bookmark tokens; a trained page retriever selects important pages per layer while offloaded pages live off-device and are pulled back on demand.
Across LongBench, InfiniteBench, and RULER, RetroLM beats standard RAG pipelines and other efficient long-context methods.
The framework reframes retrieval as selecting which cached representations to keep, rather than which raw tokens to stuff into the prompt.

Rethinking Mixture-of-Agents

Self-MoA examines whether diverse model ensembles are actually necessary for strong Mixture-of-Agents performance.

Instead of querying multiple different models, Self-MoA repeatedly samples a single strong model and aggregates its responses, trading cross-model diversity for in-model diversity.
Experiments on AlpacaEval 2.0 and other benchmarks show Self-MoA outperforming traditional MoA when proposer quality is high.
A sequential variant, Self-MoA-Seq, aggregates in sliding windows to stay within context limits while scaling the number of samples.

Application-Layer Innovations

Chroma Context-1: Training a Self-Editing Search Agent

Context-1 is a 20B agentic search model derived from gpt-oss-20B that is designed to act as a retrieval subagent rather than answer the question directly.

It decomposes a task into subqueries, searches over multiple turns, and selectively prunes its own context to remove irrelevant documents.
Trained on more than 8,000 synthetic tasks with a curriculum that shifts from recall-heavy exploration toward higher-precision retention, it sits on the report's cost-latency frontier and improves prune accuracy from 0.824 to 0.941.
Multi-hop retrieval may work best as a specialized agent problem, where small purpose-trained models learn to plan, explore, and actively edit context instead of relying on single-pass RAG pipelines.

Composer 2 Technical Report

Composer 2 is a domain-specialized model for agentic software engineering, built to handle long-horizon coding tasks while staying efficient enough for interactive use.

Its training recipe combines continued pretraining on a code-heavy mix and a 256k context extension with targeted coding SFT, then follows with large-scale reinforcement learning in Cursor's production tool-use harness.
The report introduces CursorBench, an internal benchmark drawn from real engineering sessions rather than curated public repos; Composer 2 scores 61.3 on CursorBench-3 and is positioned on a strong accuracy-cost frontier relative to larger general-purpose models.
Coding agents improve most when the model, training environment, reward design, and evaluation all match the actual software-engineering workflow.

GEPA: Reflective Prompt Evolution Can Outperform RL

GEPA (Genetic-Pareto) is a reflective prompt-evolution method that optimizes instructions using execution traces instead of updating model weights.

The system samples rollouts, has the model analyze its own traces in natural language, and proposes new prompts; a Pareto front keeps multiple candidates that perform well on different subsets of data.
Across four tasks, GEPA matches or beats GRPO-style RL with up to 35× fewer rollouts.
The work treats prompts as an external optimization layer, using natural-language reflection rather than heavyweight RL.

SFR-DeepResearch: Single-Agent RL for Deep Web Research

SFR-DeepResearch (SFR-DR) is a reinforcement-learning framework for training a single web-research agent that decides when to search, browse, or execute code.

The agent uses three minimal tools—search_internet, browse_page, stateless code_interpreter—plus a self-managed memory tool (clean_memory) that lets it control long-horizon context instead of passively appending everything.
Length-normalized RL stabilizes multi-step optimization and prevents degenerate, repetitive tool use.
Results on Humanity's Last Exam and related benchmarks highlight that context management and planning are the core bottlenecks, not just model size.

MEM1: Constant-Memory Long-Horizon Agents

MEM1 is an RL framework that trains LLM agents to operate over long multi-turn tasks while keeping memory usage nearly constant.

At each step, previous memory and new observations are merged into a compact internal state token (IS) and older context is discarded; a masked-trajectory RL scheme reconstructs valid trajectories for PPO without feeding the entire history.
MEM1-7B matches or beats much larger baselines on tasks with up to 16 sequential objectives while reducing memory use by ~3.7×.
Long-horizon behavior can come from learned internal state handling rather than expanding context windows or bolting on external memory.

Search-R1: Training LLMs to Reason and Search with RL

Search-R1 trains models to interleave step-by-step reasoning with live search-engine queries.

The framework uses a structured template—think, search, information, answer—where PPO or GRPO updates apply only to model-generated segments, treating the search engine as part of the environment.
Evaluated across seven QA datasets, Search-R1 delivers large gains over strong RAG baselines, including on multi-hop tasks like HotpotQA and 2WikiMultiHopQA.
The paper positions targeted, RL-trained search behavior as an alternative to static top-k retrieval and hand-crafted tool chains.
Greptile

Automatically review PRs with your team's standards