vLLM: LLM inference and serving engine
Fast, memory-efficient LLM inference engine with PagedAttention for production deployments at scale.
Learn more about vLLM
vLLM is a high-throughput, memory-efficient inference and serving engine for large language models (LLMs). It leverages PagedAttention, an innovative attention algorithm that manages key-value cache dynamically, reducing memory waste and enabling higher batch sizes. The engine supports continuous batching for improved throughput, model parallelism for distributed inference, and streaming outputs. Compatible with HuggingFace models, vLLM integrates seamlessly with popular transformer architectures including GPT, LLaMA, Mistral, and more. It provides OpenAI-compatible API endpoints, making it ideal for production deployments requiring low-latency inference, high throughput, and efficient GPU utilization for serving LLMs at scale.
PagedAttention Memory Optimization
Revolutionary PagedAttention algorithm treats attention key-value cache like virtual memory in operating systems, eliminating memory fragmentation and reducing waste. This innovation enables up to 24x higher throughput compared to traditional implementations by dynamically managing memory blocks, allowing more concurrent requests and larger batch sizes without running out of GPU memory.
Continuous Batching Throughput
Advanced continuous batching dynamically adds and removes requests from batches as they complete generation, maximizing GPU utilization. Unlike static batching which waits for all sequences to finish, vLLM's approach keeps the GPU constantly busy, significantly improving throughput for variable-length generation tasks and reducing average latency for production workloads.
OpenAI-Compatible API Server
Built-in API server provides drop-in replacement for OpenAI's API endpoints, enabling seamless migration from hosted services to self-hosted infrastructure. Supports streaming responses, chat completions, and standard parameters, allowing developers to switch providers without changing client code while maintaining full control over model deployment and data privacy.
from vllm import LLM, SamplingParams
llm = LLM(model="facebook/opt-125m")
prompts = ["Hello, my name is", "The future of AI is"]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
print(output.outputs[0].text)Patch release with security fixes, RTX Blackwell GPU support, and bug fixes.
- –Updated aiohttp dependency for security fix
- –Updated Protobuf dependency for security fix
- –Fixed NVFP4 MoE kernel support for RTX Blackwell workstation GPUs
- –Fixed FP8 CUTLASS group GEMM to properly fall back to Triton kernels on SM120 GPUs
- –Added Step-3.5-Flash model support
Major release with new model architectures, LoRA expansion, and enhanced speculative decoding.
- –New architectures: Kimi-K2.5, Molmo2, Step3vl 10B, Step1, GLM-Lite, Eagle2.5-8B VLM
- –LoRA expansion: Nemotron-H, InternVL2, MiniMax M2
- –Speculative decoding: EAGLE3 for Pixtral/LlavaForConditionalGeneration, Qwen3 VL MoE, draft model support
- –Embeddings: BGE-M3 sparse embeddings and ColBERT embeddings
- –Model enhancements: Voxtral streaming, SharedFusedMoE for Qwen3MoE, dynamic resolution for Nemotron Nano VL
Patch release addressing security vulnerabilities and memory leak fixes.
- –Security and memory leak fixes
See how people are using vLLM
Related Repositories
Discover similar tools and frameworks used by developers
Goose
LLM-powered agent automating local software engineering workflows.
Pi Mono
Monorepo providing AI agent development tools, unified LLM API, and deployment management for multiple providers.
Chroma
Vector database for embedding storage and semantic search.
Evo 2
Foundation model for DNA sequence generation and scoring.
PyTorch
Python framework for differentiable tensor computation and deep learning.