Wan2.1: Open-source video generation models
Diffusion transformer models for text and image-to-video generation.
Learn more about Wan2.1
Wan2.1 is a collection of diffusion-based video generation models developed for multiple video synthesis tasks. The architecture includes a custom video VAE component (Wan-VAE) for encoding and decoding video frames while preserving temporal information, paired with transformer-based diffusion models of varying scales. The smallest variant (T2V-1.3B) requires approximately 8GB of VRAM and can generate 480p video on consumer hardware, while larger variants support higher resolutions and more complex generation tasks. The models are integrated with standard frameworks like Hugging Face Diffusers and ComfyUI for inference.

Multi-task capability
Supports text-to-video, image-to-video, video editing, text-to-image, and video-to-audio generation within a single model family, rather than requiring separate specialized models for each task.
Consumer GPU compatibility
The 1.3B parameter variant operates within 8GB VRAM constraints, enabling deployment on standard consumer graphics cards without specialized hardware or quantization techniques.
Multilingual text generation
Includes capability to generate both Chinese and English text within video frames, addressing a gap in existing open-source video models at the time of release.
Related Repositories
Discover similar tools and frameworks used by developers
Qwen
Alibaba Cloud's pretrained LLMs supporting Chinese/English with up to 32K context length.
Chart-GPT
AI tool that generates charts from natural language text descriptions.
Heretic
Tool that removes safety alignment from transformer language models using directional ablation without post-training.
Ray
Unified framework for scaling AI and Python applications from laptops to clusters with distributed runtime.
llama.cpp
Quantized LLM inference with hardware-accelerated CPU/GPU backends.