SGLang: Serving framework for large language models
High-performance inference engine for LLMs and VLMs.
Learn more about SGLang
SGLang is a serving framework written in Python and CUDA that handles inference for large language models and vision language models. The framework implements scheduling, batching, and memory management optimizations to improve throughput and latency during model serving. It supports multiple hardware backends including NVIDIA GPUs, AMD GPUs, and TPUs through different implementations like the SGLang-Jax backend. Common deployment scenarios include running open-source models like Llama, Qwen, and DeepSeek, as well as proprietary models through API compatibility layers.
Multi-Backend Hardware Support
Runs natively on NVIDIA GPUs, AMD GPUs, and TPUs through specialized backends. SGLang-Jax enables TPU execution while CUDA and ROCm implementations target GPU hardware without code changes.
Cache-Aware Batch Scheduling
Zero-overhead scheduler and cache-aware load balancer optimize memory usage across concurrent requests. Reduces scheduling latency and maximizes throughput compared to naive batching approaches.
Day-One Model Support
Provides immediate integration with newly released language models through a flexible architecture that adapts to evolving model APIs. New model releases can be deployed within hours rather than weeks, ensuring users always have access to the latest AI capabilities without waiting for framework updates.
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:30000/v1",
api_key="EMPTY"
)
response = client.chat.completions.create(
model="meta-llama/Llama-3.1-8B-Instruct",
messages=[{"role": "user", "content": "What is AI?"}],
temperature=0.7
)
print(response.choices[0].message.content)Major performance improvements with 1.5x faster diffusion models and linear scaling for million-token contexts
- –Up to 1.5x faster across the board for all major diffusion models
- –Close to linear scaling with chunked pipeline parallelism for super long million-token context
- –Optimizing GLM4-MoE for Production: 65% Faster TTFT
- –EPD Disaggregation: Elastic Encoder Scaling for Vision-Language Models
- –Day 0 Support for GLM 4.7 Flash
Massive performance improvements with 10-12x faster cache-aware routing using radix trees
- –Cache-aware routing can now handle over 216,000 cache insertions per second (up from 18,900)
- –Prefix matching across 10,000 tree entries jumped from 41,000 to 124,000 operations per second
- –Under concurrent load with 64 threads, the system processes 474,000 operations per second
- –INSERT operations now process 440 MB/s (up from 38 MB/s)
- –MATCH operations handle 253 MB/s (up from 83 MB/s)
Day 0 support for multiple new models including Mimo-V2-Flash, Nemotron-Nano-v3, and LLaDA 2.0
- –Day 0 Support for Mimo-V2-Flash
- –Day 0 Support for Nemotron-Nano-v3
- –Day 0 Support for LLaDA 2.0
- –SGLang-Diffusion Day 0 Support for Qwen-Image-Edit-2509, Qwen-Image-Edit-2511, Qwen-Image-2512 and Qwen-Image-Layered
See how people are using SGLang
Top in AI & ML
Related Repositories
Discover similar tools and frameworks used by developers
whisper.cpp
Lightweight C++ port of OpenAI Whisper for cross-platform speech recognition.
PaddleOCR
Multilingual OCR toolkit with document structure extraction.
open_clip
PyTorch library for contrastive language-image pretraining.
Transformers
Unified API for pre-trained transformer models across frameworks.
llama_index
Connect LLMs to external data via RAG workflows.