Grounding DINO: Open-set object detection with vision-language grounding
Zero-shot object detection from text prompts.
Learn more about GroundingDINO
Grounding DINO is a vision-language transformer model for object detection that extends DINO with grounding capabilities through pre-training on image-text pairs. It uses a transformer architecture that jointly processes visual features and language embeddings to align object regions with textual descriptions. The model supports zero-shot detection by accepting arbitrary class names as text input, enabling detection of objects outside its training distribution. Common applications include open-world object detection, automated dataset annotation, and integration with segmentation models for instance-level tasks.
Vision-language alignment
Integrates DINO's detection backbone with grounded pre-training to directly map image regions to natural language descriptions, enabling detection based on arbitrary text queries rather than fixed class sets.
Zero-shot detection capability
Detects object classes not present in training data by leveraging language understanding, allowing the model to generalize to novel categories specified at inference time.
Transformer-based architecture
Uses a transformer encoder-decoder design that processes both visual and textual information jointly, enabling flexible reasoning about object-language relationships without separate classification heads per class.
import torch
from PIL import Image
from groundingdino.util.inference import load_model, load_image, predict, annotate
from groundingdino.util.slconfig import SLConfig
from groundingdino.models import build_model
# Load model configuration and weights
model_config_path = "groundingdino/config/GroundingDINO_SwinT_OGC.py"
model_checkpoint_path = "weights/groundingdino_swint_ogc.pth"
device = "cuda" if torch.cuda.is_available() else "cpu"
# Load the model
model = load_model(model_config_path, model_checkpoint_path, device=device)
# Load and process image
image_path = "path/to/your/image.jpg"
image_source, image = load_image(image_path)
# Define text prompt for detection
text_prompt = "cat . dog . person"
box_threshold = 0.35
text_threshold = 0.25
# Perform prediction
boxes, logits, phrases = predict(
model=model,
image=image,
caption=text_prompt,
box_threshold=box_threshold,
text_threshold=text_threshold,
device=device
)
# Annotate and save results
annotated_frame = annotate(image_source=image_source, boxes=boxes, logits=logits, phrases=phrases)
annotated_frame.save("annotated_result.jpg")
print(f"Detected {len(boxes)} objects: {phrases}")See how people are using GroundingDINO
Related Repositories
Discover similar tools and frameworks used by developers
Docling
Fast document parser for RAG and AI workflows.
DINOv2
PyTorch vision transformers pretrained on 142M unlabeled images.
LangChain
Modular framework for chaining LLMs with external data.
FAISS
Efficient approximate nearest neighbor search for billion-scale vectors.
ONNX Runtime
Cross-platform engine for optimized ONNX model execution.