SGLang: Serving framework for large language models
High-performance inference engine for LLMs and VLMs.
Learn more about SGLang
SGLang is a serving framework written in Python and CUDA that handles inference for large language models and vision language models. The framework implements scheduling, batching, and memory management optimizations to improve throughput and latency during model serving. It supports multiple hardware backends including NVIDIA GPUs, AMD GPUs, and TPUs through different implementations like the SGLang-Jax backend. Common deployment scenarios include running open-source models like Llama, Qwen, and DeepSeek, as well as proprietary models through API compatibility layers.
Multi-Backend Hardware Support
Runs natively on NVIDIA GPUs, AMD GPUs, and TPUs through specialized backends. SGLang-Jax enables TPU execution while CUDA and ROCm implementations target GPU hardware without code changes.
Cache-Aware Batch Scheduling
Zero-overhead scheduler and cache-aware load balancer optimize memory usage across concurrent requests. Reduces scheduling latency and maximizes throughput compared to naive batching approaches.
Day-One Model Support
Provides immediate integration with newly released language models through a flexible architecture that adapts to evolving model APIs. New model releases can be deployed within hours rather than weeks, ensuring users always have access to the latest AI capabilities without waiting for framework updates.
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:30000/v1",
api_key="EMPTY"
)
response = client.chat.completions.create(
model="meta-llama/Llama-3.1-8B-Instruct",
messages=[{"role": "user", "content": "What is AI?"}],
temperature=0.7
)
print(response.choices[0].message.content)Major performance improvements with 1.5x faster diffusion models and linear scaling for million-token contexts
- –Up to 1.5x faster across the board for all major diffusion models
- –Close to linear scaling with chunked pipeline parallelism for super long million-token context
- –Optimizing GLM4-MoE for Production: 65% Faster TTFT
- –EPD Disaggregation: Elastic Encoder Scaling for Vision-Language Models
- –Day 0 Support for GLM 4.7 Flash
Massive performance improvements with 10-12x faster cache-aware routing using radix trees
- –Cache-aware routing can now handle over 216,000 cache insertions per second (up from 18,900)
- –Prefix matching across 10,000 tree entries jumped from 41,000 to 124,000 operations per second
- –Under concurrent load with 64 threads, the system processes 474,000 operations per second
- –INSERT operations now process 440 MB/s (up from 38 MB/s)
- –MATCH operations handle 253 MB/s (up from 83 MB/s)
Day 0 support for multiple new models including Mimo-V2-Flash, Nemotron-Nano-v3, and LLaDA 2.0
- –Day 0 Support for Mimo-V2-Flash
- –Day 0 Support for Nemotron-Nano-v3
- –Day 0 Support for LLaDA 2.0
- –SGLang-Diffusion Day 0 Support for Qwen-Image-Edit-2509, Qwen-Image-Edit-2511, Qwen-Image-2512 and Qwen-Image-Layered
See how people are using SGLang
Related Repositories
Discover similar tools and frameworks used by developers
xFormers
Memory-efficient PyTorch components for transformer architectures.
Ultralytics YOLO
PyTorch library for YOLO-based real-time computer vision.
pix2pix
Torch implementation for paired image-to-image translation using cGANs.
YOLOv5
Real-time object detection with cross-platform deployment support.
Stable Diffusion
Text-to-image diffusion in compressed latent space.