Mask2Former: Transformer-based universal image segmentation
Unified transformer architecture for multi-task image segmentation.
Learn more about Mask2Former
Mask2Former is a computer vision model that performs image segmentation using transformer-based architecture with masked attention mechanisms. The system processes images through a backbone encoder and applies attention operations constrained by learned masks to generate segmentation outputs. It handles three segmentation task types (panoptic, instance, and semantic) through a single unified model architecture rather than task-specific variants. The codebase supports training and inference on major segmentation benchmarks including ADE20K, Cityscapes, COCO, and Mapillary Vistas, with additional support for video instance segmentation.
Unified multi-task architecture
A single model handles panoptic, instance, and semantic segmentation without task-specific modifications. This contrasts with prior approaches that typically required separate models or significant architectural changes per task.
Masked attention mechanism
The transformer uses learned masks to constrain attention operations, reducing computational overhead compared to full attention while maintaining segmentation quality. This design choice improves efficiency during both training and inference.
Multi-dataset support
The framework includes implementations for multiple major segmentation datasets and benchmarks, with pre-trained models available in the Model Zoo. Video instance segmentation is also supported through an accompanying technical report.
Top in AI & ML
Related Repositories
Discover similar tools and frameworks used by developers
Civitai
Community platform for sharing Stable Diffusion models, embeddings, and AI generation assets.
Fish Speech
Transformer-based TTS with voice cloning from reference audio.
KoboldCpp
Self-contained llama.cpp distribution with KoboldAI API for running LLMs on consumer hardware.
Stable Diffusion
CLIP-conditioned latent diffusion model for text-to-image synthesis.
NAFNet
Efficient PyTorch architecture for image restoration tasks.