Mask2Former: Transformer-based universal image segmentation
Unified transformer architecture for multi-task image segmentation.
Learn more about Mask2Former
Mask2Former is a computer vision model that performs image segmentation using transformer-based architecture with masked attention mechanisms. The system processes images through a backbone encoder and applies attention operations constrained by learned masks to generate segmentation outputs. It handles three segmentation task types (panoptic, instance, and semantic) through a single unified model architecture rather than task-specific variants. The codebase supports training and inference on major segmentation benchmarks including ADE20K, Cityscapes, COCO, and Mapillary Vistas, with additional support for video instance segmentation.
Unified multi-task architecture
A single model handles panoptic, instance, and semantic segmentation without task-specific modifications. This contrasts with prior approaches that typically required separate models or significant architectural changes per task.
Masked attention mechanism
The transformer uses learned masks to constrain attention operations, reducing computational overhead compared to full attention while maintaining segmentation quality. This design choice improves efficiency during both training and inference.
Multi-dataset support
The framework includes implementations for multiple major segmentation datasets and benchmarks, with pre-trained models available in the Model Zoo. Video instance segmentation is also supported through an accompanying technical report.
pip install git+https://github.com/facebookresearch/Mask2Former.gitRelated Repositories
Discover similar tools and frameworks used by developers
PaddleOCR
Multilingual OCR toolkit with document structure extraction.
ollama
Go-based CLI for local LLM inference and management.
koboldcpp
Self-contained distribution of llama.cpp with KoboldAI-compatible API server for running large language models locally on consumer hardware.
pytorch
Python framework for differentiable tensor computation and deep learning.
opencv
Cross-platform C++ library for real-time computer vision algorithms.