vLLM: LLM inference and serving engine
Fast and memory-efficient inference engine for large language models with PagedAttention optimization for production deployments at scale.
Learn more about vllm
vLLM is a high-throughput, memory-efficient inference and serving engine for large language models (LLMs). It leverages PagedAttention, an innovative attention algorithm that manages key-value cache dynamically, reducing memory waste and enabling higher batch sizes. The engine supports continuous batching for improved throughput, model parallelism for distributed inference, and streaming outputs. Compatible with HuggingFace models, vLLM integrates seamlessly with popular transformer architectures including GPT, LLaMA, Mistral, and more. It provides OpenAI-compatible API endpoints, making it ideal for production deployments requiring low-latency inference, high throughput, and efficient GPU utilization for serving LLMs at scale.
PagedAttention Memory Optimization
Revolutionary PagedAttention algorithm treats attention key-value cache like virtual memory in operating systems, eliminating memory fragmentation and reducing waste. This innovation enables up to 24x higher throughput compared to traditional implementations by dynamically managing memory blocks, allowing more concurrent requests and larger batch sizes without running out of GPU memory.
Continuous Batching Throughput
Advanced continuous batching dynamically adds and removes requests from batches as they complete generation, maximizing GPU utilization. Unlike static batching which waits for all sequences to finish, vLLM's approach keeps the GPU constantly busy, significantly improving throughput for variable-length generation tasks and reducing average latency for production workloads.
OpenAI-Compatible API Server
Built-in API server provides drop-in replacement for OpenAI's API endpoints, enabling seamless migration from hosted services to self-hosted infrastructure. Supports streaming responses, chat completions, and standard parameters, allowing developers to switch providers without changing client code while maintaining full control over model deployment and data privacy.
from vllm import LLM, SamplingParams
llm = LLM(model="facebook/opt-125m")
prompts = ["Hello, my name is", "The future of AI is"]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
print(output.outputs[0].text)V0 engine completely removed; V1 is now the only engine. C++17 build requirement enforced; FULL_AND_PIECEWISE CUDA graph mode now default.
- –Migrate all code from V0 engine (AsyncLLMEngine, LLMEngine, attention backends) as they are fully removed.
- –Avoid --async-scheduling flag as it produces gibberish output in preemption cases; fixed in next release.
Breaking release upgrades PyTorch to 2.8.0, deprecates V0 APIs, and disables FlashMLA on Blackwell GPUs due to compatibility issues.
- –Review V0 deprecations and API changes before upgrading; PyTorch 2.8.0 is now required.
- –Use `--safetensors-load-strategy` for NFS acceleration and note FlashMLA is disabled on Blackwell.
Critical security release patching two vulnerabilities: HTTP header DoS and arbitrary code execution via eval().
- –Upgrade immediately to fix CVE-level flaws allowing HTTP header exhaustion and remote code execution through type conversion.
- –Patch also resolves CUTLASS MLA Full CUDAGraph crash; no configuration changes required.
See how people are using vllm
Related Repositories
Discover similar tools and frameworks used by developers
unsloth
Memory-efficient Python library for accelerated LLM training.
segment-anything
Transformer-based promptable segmentation with zero-shot generalization.
openvino
Convert and deploy deep learning models across Intel hardware.
ControlNet
Dual-branch architecture for conditional diffusion model control.
yolov5
Real-time object detection with cross-platform deployment support.