Stable Diffusion: Latent text-to-image diffusion model
CLIP-conditioned latent diffusion model for text-to-image synthesis.
Learn more about stable-diffusion
Stable Diffusion is a latent diffusion model for text-to-image synthesis. It operates by encoding images into a latent space using a downsampling-factor 8 autoencoder, then applying diffusion processes conditioned on text embeddings from a CLIP encoder. The architecture consists of an 860M parameter UNet and a 123M parameter text encoder, designed to run on GPUs with at least 10GB VRAM. The model was pretrained on 256x256 images and subsequently fine-tuned on 512x512 images, making it suitable for generating images from natural language descriptions.
Latent Space Diffusion
Operates in compressed latent space with 8x downsampling rather than pixel space, reducing computational cost by 64x per dimension. Maintains high image quality while enabling faster generation on consumer hardware compared to pixel-based diffusion models.
Consumer GPU Compatible
860M UNet and 123M text encoder architecture runs on GPUs with 10GB VRAM. Enables local deployment without cloud infrastructure or high-end datacenter hardware.
CLIP Text Conditioning
Uses frozen CLIP ViT-L/14 encoder for text-to-image conditioning with 123M parameters. Leverages pretrained vision-language representations for flexible natural language control without custom text encoder training.
from diffusers import StableDiffusionPipeline
import torch
pipe = StableDiffusionPipeline.from_pretrained(
"CompVis/stable-diffusion-v1-4",
torch_dtype=torch.float16
).to("cuda")
image = pipe("a photograph of an astronaut riding a horse").images[0]
image.save("astronaut.png")See how people are using stable-diffusion
Related Repositories
Discover similar tools and frameworks used by developers
onnxruntime
Cross-platform engine for optimized ONNX model execution.
llama
PyTorch inference for Meta's Llama language models.
faiss
Efficient approximate nearest neighbor search for billion-scale vectors.
deepface
Python library wrapping multiple face recognition deep learning models.
bolt.new
LLM-powered browser IDE with integrated WebContainers runtime.