MMDetection: PyTorch object detection toolbox
Modular PyTorch framework for object detection research and deployment.
Learn more about mmdetection
MMDetection is a PyTorch-based detection framework developed as part of the OpenMMLab project. The codebase uses a modular architecture where detection pipelines are constructed by combining interchangeable components for backbones, necks, heads, and loss functions. It supports multiple detection paradigms including two-stage detectors (Faster R-CNN, Cascade R-CNN), single-stage detectors (RetinaNet, SSD, YOLO variants), and transformer-based approaches (DETR, Grounding DINO). The framework is commonly used for research, benchmarking detection algorithms, and deploying detection models in production applications.
Modular Component Architecture
Detection pipelines are built by combining independent modules for backbones, feature pyramids, detection heads, and loss functions. Enables customization and experimentation without modifying core framework code or requiring forks.
Multi-Task Detection Framework
Handles object detection, instance segmentation, panoptic segmentation, and semi-supervised learning within a single unified framework. Eliminates the need for separate tools or switching between different codebases for different vision tasks.
Comprehensive Model Zoo
Includes implementations of diverse detector architectures spanning two-stage methods, single-stage detectors, and transformer-based approaches. Ships with pre-trained weights for immediate benchmarking and transfer learning without training from scratch.
from mmdet.apis import init_detector, inference_detector
config_file = 'configs/faster_rcnn/faster_rcnn_r50_fpn_1x_coco.py'
checkpoint_file = 'checkpoints/faster_rcnn_r50_fpn_1x_coco.pth'
model = init_detector(config_file, checkpoint_file, device='cuda:0')
result = inference_detector(model, 'demo/demo.jpg')
print(result)Adds MM-Grounding-DINO, an open-source baseline for open-set detection (OVD, phrase grounding, REC) built on MMDetection with full training code and pre-trained models.
- –Use MM-Grounding-DINO-Tiny for open-vocabulary detection tasks; it outperforms the original Grounding-DINO-Tiny baseline.
- –Access training code, pre-training datasets, and fine-tuning configs at configs/mm_grounding_dino for reproducible grounding and detection pipelines.
Adds four SOTA Transformer models (DDQ, CO-DETR, AlignDETR, H-DINO) and exclusive Grounding DINO fine-tuning support; introduces FSDP/DeepSpeed training and RF100 benchmark for CNN vs. Transformer comparison.
- –Enable AMP, gradient checkpointing, and FrozenBN in DINO to reduce memory usage; use FSDP or DeepSpeed to train large models with as low as 8.5 GB peak memory.
- –Fine-tune Grounding DINO (only library supporting this) for +0.9 mAP over official zero-shot; train Detic for open-vocabulary detection or multi-dataset joint training.
Adds tracking (MOT/VIS), multimodal inference (GLIP, XDecoder), and ViTDet; install multimodal deps via pip install -r requirements/multimodal.txt or mim install mmdet[multimodal].
- –Install multimodal dependencies (requirements/multimodal.txt) to enable GLIP and XDecoder inference and evaluation.
- –Use new tracking algorithms (SORT, ByteTrack, OCSORT, etc.) and gradio demo for local image task experimentation.
Related Repositories
Discover similar tools and frameworks used by developers
tesseract
LSTM-based OCR engine supporting 100+ languages.
Wan2.2
Open-source diffusion framework for multi-modal video generation.
adk-python
Modular Python framework for building production AI agents.
stablediffusion
Text-to-image diffusion in compressed latent space.
Wan2.1
Diffusion transformer models for text and image-to-video generation.