CLIP: Contrastive language-image pretraining model
Multimodal zero-shot classifier using contrastive vision-language learning.
Learn more about CLIP
CLIP is a neural network system that learns visual concepts from natural language supervision by training an image encoder and text encoder to maximize cosine similarity between correct image-text pairs while minimizing similarity for incorrect pairs. The architecture consists of two separate encoders—a vision transformer or ResNet for images and a transformer for text—that project their inputs into a shared embedding space where semantically related images and text descriptions are positioned closely together. The model enables zero-shot image classification by computing similarity scores between an image's embedding and text embeddings of candidate class descriptions, selecting the class with the highest similarity. CLIP's contrastive pretraining approach allows it to perform image retrieval, classification, and other vision-language tasks without task-specific fine-tuning, making it adaptable to diverse downstream applications. The system demonstrates strong generalization capabilities across different visual domains due to training on 400 million image-text pairs collected from the internet, though performance varies based on the specificity and distribution of evaluation tasks.
Zero-shot transfer learning
CLIP can classify images into arbitrary categories specified as natural language text without any task-specific training data or fine-tuning, matching supervised baseline performance on standard benchmarks like ImageNet.
Contrastive training approach
The model uses contrastive learning to align image and text representations in a shared embedding space, enabling direct comparison between visual and linguistic features through cosine similarity.
Multiple model variants
CLIP provides several model configurations with different vision architectures (ResNet and Vision Transformer variants) and scales, allowing trade-offs between computational requirements and performance.
import torch
import clip
from PIL import Image
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)
image = preprocess(Image.open("photo.jpg")).unsqueeze(0).to(device)
text = clip.tokenize(["a cat", "a dog", "a car"]).to(device)
with torch.no_grad():
logits_per_image, _ = model(image, text)
probs = logits_per_image.softmax(dim=-1).cpu().numpy()
print(f"Predicted: {['cat', 'dog', 'car'][probs[0].argmax()]}")See how people are using CLIP
Related Repositories
Discover similar tools and frameworks used by developers
fastmcp
Build Model Context Protocol servers with decorators.
AI-Trader
LLM agent benchmarking framework for autonomous market trading.
PentestGPT
AI-assisted Python framework for automated security testing.
Kimi-K2
Trillion-parameter MoE model with Muon-optimized training.
adk-python
Modular Python framework for building production AI agents.