Navigate:
All Repossglang
~$SGLANG0.4%

SGLang: Serving framework for large language models

High-performance inference engine for LLMs and VLMs.

LIVE RANKINGS • 06:51 AM • STEADY
TOP 25TOP 25TOP 25TOP 25TOP 25TOP 25TOP 25TOP 25TOP 25TOP 25TOP 25TOP 25
OVERALL
#20
11
AI & ML
#11
6
30 DAY RANKING TREND
ovr#20
·AI#11
STARS
22.2K
FORKS
4.0K
DOWNLOADS
7D STARS
+91
7D FORKS
+41
Tags:
See Repo:
Share:

Learn more about sglang

SGLang is a serving framework written in Python and CUDA that handles inference for large language models and vision language models. The framework implements scheduling, batching, and memory management optimizations to improve throughput and latency during model serving. It supports multiple hardware backends including NVIDIA GPUs, AMD GPUs, and TPUs through different implementations like the SGLang-Jax backend. Common deployment scenarios include running open-source models like Llama, Qwen, and DeepSeek, as well as proprietary models through API compatibility layers.


1

Multi-Backend Hardware Support

Runs natively on NVIDIA GPUs, AMD GPUs, and TPUs through specialized backends. SGLang-Jax enables TPU execution while CUDA and ROCm implementations target GPU hardware without code changes.

2

Cache-Aware Batch Scheduling

Zero-overhead scheduler and cache-aware load balancer optimize memory usage across concurrent requests. Reduces scheduling latency and maximizes throughput compared to naive batching approaches.

3

Day-One Model Support

Provides immediate integration with newly released language models through a flexible architecture that adapts to evolving model APIs. New model releases can be deployed within hours rather than weeks, ensuring users always have access to the latest AI capabilities without waiting for framework updates.


from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:30000/v1",
    api_key="EMPTY"
)

response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": "What is AI?"}],
    temperature=0.7
)

print(response.choices[0].message.content)


vv0.5.5

Adds day-0 support for Kimi-K2-Thinking and Minimax-M2 models, video/image generation, and patches CVE-2025-10164 unsafe pickle deserialization.

  • Update to mitigate CVE-2025-10164 by blocking unsafe pickle deserialization in multiprocessing serializer.
  • Enable video inference and leverage Blackwell MoE kernel optimizations for ~10% B200 FP8 performance gain.
vv0.5.4

Adds model gateway v0.2, beta speculative decoding scheduler, and DeepSeek-V3.2 optimizations; no breaking changes specified in release notes.

  • Enable model gateway v0.2 for routing and beta overlap scheduler for speculative decoding with piecewise CUDA graph prefill.
  • Add native ModelOpt quantization, prefix cache for Qwen3/GDN/Mamba, and support for Nemotron, DeepSeek OCR, Qwen3-Omni, Olmo 3 models.
vv0.5.3

Adds day-0 DeepSeek-V3.2 sparse attention, deterministic inference across backends, FlashAttention 4 prefill, and expanded model support (Qwen3-Next MTP+DP, Qwen3-VL, Apertus, SOLAR).

  • Enable deterministic inference by configuring attention backend flags; see blog post for reproducibility requirements.
  • Upgrade dependencies to include sentencepiece; Qwen3-Next now supports MTP+DP, Ascend NPU, and Blackwell hardware.

See how people are using sglang

Loading tweets...


[ EXPLORE MORE ]

Related Repositories

Discover similar tools and frameworks used by developers