Navigate:
~$CLIP0.1%

CLIP: Contrastive language-image pretraining model

Multimodal zero-shot classifier using contrastive vision-language learning.

LIVE RANKINGS • 06:52 AM • STEADY
OVERALL
#131
37
AI & ML
#58
10
30 DAY RANKING TREND
ovr#131
·AI#58
STARS
32.2K
FORKS
3.9K
DOWNLOADS
118
7D STARS
+26
7D FORKS
+6
Tags:
See Repo:
Share:

Learn more about CLIP

CLIP is a neural network system that learns visual concepts from natural language supervision by training an image encoder and text encoder to maximize cosine similarity between correct image-text pairs while minimizing similarity for incorrect pairs. The architecture consists of two separate encoders—a vision transformer or ResNet for images and a transformer for text—that project their inputs into a shared embedding space where semantically related images and text descriptions are positioned closely together. The model enables zero-shot image classification by computing similarity scores between an image's embedding and text embeddings of candidate class descriptions, selecting the class with the highest similarity. CLIP's contrastive pretraining approach allows it to perform image retrieval, classification, and other vision-language tasks without task-specific fine-tuning, making it adaptable to diverse downstream applications. The system demonstrates strong generalization capabilities across different visual domains due to training on 400 million image-text pairs collected from the internet, though performance varies based on the specificity and distribution of evaluation tasks.

CLIP

1

Zero-shot transfer learning

CLIP can classify images into arbitrary categories specified as natural language text without any task-specific training data or fine-tuning, matching supervised baseline performance on standard benchmarks like ImageNet.

2

Contrastive training approach

The model uses contrastive learning to align image and text representations in a shared embedding space, enabling direct comparison between visual and linguistic features through cosine similarity.

3

Multiple model variants

CLIP provides several model configurations with different vision architectures (ResNet and Vision Transformer variants) and scales, allowing trade-offs between computational requirements and performance.


import torch
import clip
from PIL import Image

device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)

image = preprocess(Image.open("photo.jpg")).unsqueeze(0).to(device)
text = clip.tokenize(["a cat", "a dog", "a car"]).to(device)

with torch.no_grad():
    logits_per_image, _ = model(image, text)
    probs = logits_per_image.softmax(dim=-1).cpu().numpy()

print(f"Predicted: {['cat', 'dog', 'car'][probs[0].argmax()]}")

See how people are using CLIP

Loading tweets...


[ EXPLORE MORE ]

Related Repositories

Discover similar tools and frameworks used by developers