Navigate:

All ReposCLIP

~$CLIP↑0.2%

CLIP: Contrastive language-image pretraining model

Multimodal zero-shot classifier using contrastive vision-language learning.

LIVE RANKINGS • 01:59 PM • STEADY

OVERALL

#251

AI & ML

#79

30 DAY RANKING TREND

ovr#251

·AI#79

STARS

32.6K

FORKS

3.9K

7D STARS

+62

7D FORKS

+21

Tags:

AI & ML

See Repo:

Learn more about CLIP

CLIP is a neural network system that learns visual concepts from natural language supervision by training an image encoder and text encoder to maximize cosine similarity between correct image-text pairs while minimizing similarity for incorrect pairs. The architecture consists of two separate encoders—a vision transformer or ResNet for images and a transformer for text—that project their inputs into a shared embedding space where semantically related images and text descriptions are positioned closely together. The model enables zero-shot image classification by computing similarity scores between an image's embedding and text embeddings of candidate class descriptions, selecting the class with the highest similarity. CLIP's contrastive pretraining approach allows it to perform image retrieval, classification, and other vision-language tasks without task-specific fine-tuning, making it adaptable to diverse downstream applications. The system demonstrates strong generalization capabilities across different visual domains due to training on 400 million image-text pairs collected from the internet, though performance varies based on the specificity and distribution of evaluation tasks.

Zero-shot transfer learning

CLIP can classify images into arbitrary categories specified as natural language text without any task-specific training data or fine-tuning, matching supervised baseline performance on standard benchmarks like ImageNet.

Contrastive training approach

The model uses contrastive learning to align image and text representations in a shared embedding space, enabling direct comparison between visual and linguistic features through cosine similarity.

Multiple model variants

CLIP provides several model configurations with different vision architectures (ResNet and Vision Transformer variants) and scales, allowing trade-offs between computational requirements and performance.

import torch
import clip
from PIL import Image

device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)

image = preprocess(Image.open("photo.jpg")).unsqueeze(0).to(device)
text = clip.tokenize(["a cat", "a dog", "a car"]).to(device)

with torch.no_grad():
    logits_per_image, _ = model(image, text)
    probs = logits_per_image.softmax(dim=-1).cpu().numpy()

print(f"Predicted: {['cat', 'dog', 'car'][probs[0].argmax()]}")

See how people are using CLIP

Loading tweets...

Top in AI & ML

Trending Repos

Pi Mono

17,222#1

OpenClaw

233,443#2

Zvec

8,089#3

Claude Code

70,649#4

Heretic

9,761#5

See all →

LIVE RANKINGS • 01:59 PM • STEADY

OVERALL

#251

AI & ML

#79

30 DAY RANKING TREND

ovr#251

·AI#79

STARS

32.6K

FORKS

3.9K

7D STARS

+62

7D FORKS

+21

[ EXPLORE MORE ]

Related Repositories

Discover similar tools and frameworks used by developers

CLIP: Contrastive language-image pretraining model

Learn more about CLIP

What is CLIP for?

What makes CLIP different?

Zero-shot transfer learning

Contrastive training approach

Multiple model variants

Example code snippets

See how people are using CLIP

Top in AI & ML

Pi Mono

OpenClaw

Claude Code

Heretic

Rowboat

Trending Repos

Pi Mono

OpenClaw

Zvec

Claude Code

Heretic

Related Repositories

Claude Code

OpenAI.fm

nanoGPT

PyTorch

Context7

Product

Company

Helpful Links