Navigate:

All ReposvLLM

~$VLLM↑1.3%

vLLM: LLM inference and serving engine

Fast, memory-efficient LLM inference engine with PagedAttention for production deployments at scale.

LIVE RANKINGS • 12:30 PM • STEADY

TOP 25TOP 25TOP 25TOP 25TOP 25TOP 25TOP 25TOP 25TOP 25TOP 25TOP 25TOP 25

OVERALL

#18

AI & ML

#14

30 DAY RANKING TREND

ovr#18

·AI#14

STARS

71.3K

FORKS

13.7K

7D STARS

+897

7D FORKS

+258

Tags:

AI & ML

See Repo:

Learn more about vLLM

vLLM is a high-throughput, memory-efficient inference and serving engine for large language models (LLMs). It leverages PagedAttention, an innovative attention algorithm that manages key-value cache dynamically, reducing memory waste and enabling higher batch sizes. The engine supports continuous batching for improved throughput, model parallelism for distributed inference, and streaming outputs. Compatible with HuggingFace models, vLLM integrates seamlessly with popular transformer architectures including GPT, LLaMA, Mistral, and more. It provides OpenAI-compatible API endpoints, making it ideal for production deployments requiring low-latency inference, high throughput, and efficient GPU utilization for serving LLMs at scale.

PagedAttention Memory Optimization

Revolutionary PagedAttention algorithm treats attention key-value cache like virtual memory in operating systems, eliminating memory fragmentation and reducing waste. This innovation enables up to 24x higher throughput compared to traditional implementations by dynamically managing memory blocks, allowing more concurrent requests and larger batch sizes without running out of GPU memory.

Continuous Batching Throughput

Advanced continuous batching dynamically adds and removes requests from batches as they complete generation, maximizing GPU utilization. Unlike static batching which waits for all sequences to finish, vLLM's approach keeps the GPU constantly busy, significantly improving throughput for variable-length generation tasks and reducing average latency for production workloads.

OpenAI-Compatible API Server

Built-in API server provides drop-in replacement for OpenAI's API endpoints, enabling seamless migration from hosted services to self-hosted infrastructure. Supports streaming responses, chat completions, and standard parameters, allowing developers to switch providers without changing client code while maintaining full control over model deployment and data privacy.

from vllm import LLM, SamplingParams

llm = LLM(model="facebook/opt-125m")

prompts = ["Hello, my name is", "The future of AI is"]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

outputs = llm.generate(prompts, sampling_params)
for output in outputs:
    print(output.outputs[0].text)

vv0.15.1

Patch release with security fixes, RTX Blackwell GPU support, and bug fixes.

–Updated aiohttp dependency for security fix
–Updated Protobuf dependency for security fix
–Fixed NVFP4 MoE kernel support for RTX Blackwell workstation GPUs
–Fixed FP8 CUTLASS group GEMM to properly fall back to Triton kernels on SM120 GPUs
–Added Step-3.5-Flash model support

vv0.15.0

Major release with new model architectures, LoRA expansion, and enhanced speculative decoding.

–New architectures: Kimi-K2.5, Molmo2, Step3vl 10B, Step1, GLM-Lite, Eagle2.5-8B VLM
–LoRA expansion: Nemotron-H, InternVL2, MiniMax M2
–Speculative decoding: EAGLE3 for Pixtral/LlavaForConditionalGeneration, Qwen3 VL MoE, draft model support
–Embeddings: BGE-M3 sparse embeddings and ColBERT embeddings
–Model enhancements: Voxtral streaming, SharedFusedMoE for Qwen3MoE, dynamic resolution for Nemotron Nano VL

vv0.14.1

Patch release addressing security vulnerabilities and memory leak fixes.

–Security and memory leak fixes

See how people are using vLLM

Loading tweets...

Top in AI & ML

Trending Repos

Pi Mono

17,222#1

OpenClaw

233,443#2

Zvec

8,089#3

Claude Code

70,649#4

Heretic

9,761#5

See all →

LIVE RANKINGS • 12:30 PM • STEADY

TOP 25TOP 25TOP 25TOP 25TOP 25TOP 25TOP 25TOP 25TOP 25TOP 25TOP 25TOP 25

OVERALL

#18

AI & ML

#14

30 DAY RANKING TREND

ovr#18

·AI#14

STARS

71.3K

FORKS

13.7K

7D STARS

+897

7D FORKS

+258

[ EXPLORE MORE ]

Related Repositories

Discover similar tools and frameworks used by developers

vLLM: LLM inference and serving engine

Learn more about vLLM

What is vLLM for?

What makes vLLM different?

PagedAttention Memory Optimization

Continuous Batching Throughput

OpenAI-Compatible API Server

Example code snippets

From YouTube

Recent Changes

See how people are using vLLM

Top in AI & ML

Pi Mono

OpenClaw

Claude Code

Heretic

Rowboat

Trending Repos

Pi Mono

OpenClaw

Zvec

Claude Code

Heretic

Related Repositories

Goose

Pi Mono

Chroma

Evo 2

PyTorch

Product

Company

Helpful Links