Navigate:
~$VLLM1.3%

vLLM: LLM inference and serving engine

Fast, memory-efficient LLM inference engine with PagedAttention for production deployments at scale.

LIVE RANKINGS • 12:30 PM • STEADY
TOP 25TOP 25TOP 25TOP 25TOP 25TOP 25TOP 25TOP 25TOP 25TOP 25TOP 25TOP 25
OVERALL
#18
4
AI & ML
#14
3
30 DAY RANKING TREND
ovr#18
·AI#14
STARS
71.3K
FORKS
13.7K
7D STARS
+897
7D FORKS
+258
Tags:
See Repo:
Share:

Learn more about vLLM

vLLM is a high-throughput, memory-efficient inference and serving engine for large language models (LLMs). It leverages PagedAttention, an innovative attention algorithm that manages key-value cache dynamically, reducing memory waste and enabling higher batch sizes. The engine supports continuous batching for improved throughput, model parallelism for distributed inference, and streaming outputs. Compatible with HuggingFace models, vLLM integrates seamlessly with popular transformer architectures including GPT, LLaMA, Mistral, and more. It provides OpenAI-compatible API endpoints, making it ideal for production deployments requiring low-latency inference, high throughput, and efficient GPU utilization for serving LLMs at scale.

vLLM

1

PagedAttention Memory Optimization

Revolutionary PagedAttention algorithm treats attention key-value cache like virtual memory in operating systems, eliminating memory fragmentation and reducing waste. This innovation enables up to 24x higher throughput compared to traditional implementations by dynamically managing memory blocks, allowing more concurrent requests and larger batch sizes without running out of GPU memory.

2

Continuous Batching Throughput

Advanced continuous batching dynamically adds and removes requests from batches as they complete generation, maximizing GPU utilization. Unlike static batching which waits for all sequences to finish, vLLM's approach keeps the GPU constantly busy, significantly improving throughput for variable-length generation tasks and reducing average latency for production workloads.

3

OpenAI-Compatible API Server

Built-in API server provides drop-in replacement for OpenAI's API endpoints, enabling seamless migration from hosted services to self-hosted infrastructure. Supports streaming responses, chat completions, and standard parameters, allowing developers to switch providers without changing client code while maintaining full control over model deployment and data privacy.


from vllm import LLM, SamplingParams

llm = LLM(model="facebook/opt-125m")

prompts = ["Hello, my name is", "The future of AI is"]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

outputs = llm.generate(prompts, sampling_params)
for output in outputs:
    print(output.outputs[0].text)


vv0.15.1

Patch release with security fixes, RTX Blackwell GPU support, and bug fixes.

  • Updated aiohttp dependency for security fix
  • Updated Protobuf dependency for security fix
  • Fixed NVFP4 MoE kernel support for RTX Blackwell workstation GPUs
  • Fixed FP8 CUTLASS group GEMM to properly fall back to Triton kernels on SM120 GPUs
  • Added Step-3.5-Flash model support
vv0.15.0

Major release with new model architectures, LoRA expansion, and enhanced speculative decoding.

  • New architectures: Kimi-K2.5, Molmo2, Step3vl 10B, Step1, GLM-Lite, Eagle2.5-8B VLM
  • LoRA expansion: Nemotron-H, InternVL2, MiniMax M2
  • Speculative decoding: EAGLE3 for Pixtral/LlavaForConditionalGeneration, Qwen3 VL MoE, draft model support
  • Embeddings: BGE-M3 sparse embeddings and ColBERT embeddings
  • Model enhancements: Voxtral streaming, SharedFusedMoE for Qwen3MoE, dynamic resolution for Nemotron Nano VL
vv0.14.1

Patch release addressing security vulnerabilities and memory leak fixes.

  • Security and memory leak fixes

See how people are using vLLM

Loading tweets...


[ EXPLORE MORE ]

Related Repositories

Discover similar tools and frameworks used by developers