Navigate:
~$VLLM0.3%

vLLM: LLM inference and serving engine

Fast and memory-efficient inference engine for large language models with PagedAttention optimization for production deployments at scale.

LIVE RANKINGS • 06:52 AM • STEADY
TOP 10TOP 10TOP 10TOP 10TOP 10TOP 10TOP 10TOP 10TOP 10TOP 10TOP 10TOP 10
OVERALL
#10
4
AI & ML
#4
5
30 DAY RANKING TREND
ovr#10
·AI#4
STARS
67.2K
FORKS
12.5K
DOWNLOADS
4
7D STARS
+216
7D FORKS
+67
Tags:
See Repo:
Share:

Learn more about vllm

vLLM is a high-throughput, memory-efficient inference and serving engine for large language models (LLMs). It leverages PagedAttention, an innovative attention algorithm that manages key-value cache dynamically, reducing memory waste and enabling higher batch sizes. The engine supports continuous batching for improved throughput, model parallelism for distributed inference, and streaming outputs. Compatible with HuggingFace models, vLLM integrates seamlessly with popular transformer architectures including GPT, LLaMA, Mistral, and more. It provides OpenAI-compatible API endpoints, making it ideal for production deployments requiring low-latency inference, high throughput, and efficient GPU utilization for serving LLMs at scale.

vllm

1

PagedAttention Memory Optimization

Revolutionary PagedAttention algorithm treats attention key-value cache like virtual memory in operating systems, eliminating memory fragmentation and reducing waste. This innovation enables up to 24x higher throughput compared to traditional implementations by dynamically managing memory blocks, allowing more concurrent requests and larger batch sizes without running out of GPU memory.

2

Continuous Batching Throughput

Advanced continuous batching dynamically adds and removes requests from batches as they complete generation, maximizing GPU utilization. Unlike static batching which waits for all sequences to finish, vLLM's approach keeps the GPU constantly busy, significantly improving throughput for variable-length generation tasks and reducing average latency for production workloads.

3

OpenAI-Compatible API Server

Built-in API server provides drop-in replacement for OpenAI's API endpoints, enabling seamless migration from hosted services to self-hosted infrastructure. Supports streaming responses, chat completions, and standard parameters, allowing developers to switch providers without changing client code while maintaining full control over model deployment and data privacy.


from vllm import LLM, SamplingParams

llm = LLM(model="facebook/opt-125m")

prompts = ["Hello, my name is", "The future of AI is"]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

outputs = llm.generate(prompts, sampling_params)
for output in outputs:
    print(output.outputs[0].text)


vv0.11.0

V0 engine completely removed; V1 is now the only engine. C++17 build requirement enforced; FULL_AND_PIECEWISE CUDA graph mode now default.

  • Migrate all code from V0 engine (AsyncLLMEngine, LLMEngine, attention backends) as they are fully removed.
  • Avoid --async-scheduling flag as it produces gibberish output in preemption cases; fixed in next release.
vv0.10.2

Breaking release upgrades PyTorch to 2.8.0, deprecates V0 APIs, and disables FlashMLA on Blackwell GPUs due to compatibility issues.

  • Review V0 deprecations and API changes before upgrading; PyTorch 2.8.0 is now required.
  • Use `--safetensors-load-strategy` for NFS acceleration and note FlashMLA is disabled on Blackwell.
vv0.10.1.1

Critical security release patching two vulnerabilities: HTTP header DoS and arbitrary code execution via eval().

  • Upgrade immediately to fix CVE-level flaws allowing HTTP header exhaustion and remote code execution through type conversion.
  • Patch also resolves CUTLASS MLA Full CUDAGraph crash; no configuration changes required.

See how people are using vllm

Loading tweets...


[ EXPLORE MORE ]

Related Repositories

Discover similar tools and frameworks used by developers