Mask R-CNN
Mask R-CNN
Problem
Framing
Instance segmentation needed detector-grade recognition and pixel-accurate masks in one model, but RoIPool broke RoI-to-pixel alignment. Mask R-CNN closes this with a parallel mask head plus quantization-free RoIAlign, reaching 35.7 mask AP on COCO test-dev at 5 fps.
Currently Used Methods
Foundational
- @renFasterRCNN2015 — two-stage detection with shared RoI features.
- Limitation in context: no mask branch, and RoIPool misaligns pixel outputs.
- Fully Convolutional Networks for Semantic Segmentation — dense semantic labeling with fully convolutional prediction.
- Limitation in context: no instance separation for overlapping objects.
- Multi-task Network Cascades — cascaded detection, box, and mask stages.
- Limitation in context: more complex pipeline and lower COCO mask AP.
- FCIS: Fully Convolutional Instance-aware Semantic Segmentation — position-sensitive score maps for instance masks.
- Limitation in context: overlap artifacts and weaker COCO results.
- @heDeepResidualLearning2016 — residual backbones for strong visual features.
- Limitation in context: backbone strength alone does not fix alignment.
Proposed Method
Architecture
Mask R-CNN extends Faster R-CNN with a third per-RoI branch that predicts an binary mask. The head is a small FCN on RoI-aligned features, instantiated with ResNet-C4 or ResNet-FPN detectors.

Loss / Objective
The objective adds a per-pixel mask loss on positive RoIs.
Algorithm
RoIAlign removes coordinate quantization by sampling exact floating-point locations with bilinear interpolation.
Training Procedure
- Backbone: ResNet-50/101-C4 or ResNet-50/101-FPN.
- Mask resolution: .
- Training data: COCO trainval35k; ablations on minival.
- Image scale: shorter side sampled from .
- Inference: top 100 detections per image.
- Reported throughput: 5 fps for ResNet-101-FPN.
Evaluation
Datasets
- COCO trainval35k, minival, test-dev.
- COCO person keypoints.
- Cityscapes test.
Metrics
- COCO mask AP.
- and .
- , , .
- Box AP.
- Keypoint AP.
Headline results
- COCO test-dev, ResNet-101-C4: mask AP 33.1, 54.9, 34.8.
- COCO test-dev, ResNet-101-FPN: mask AP 35.7, 58.0, 37.8.
- COCO test-dev, ResNeXt-101-FPN: mask AP 37.1, 60.0, 39.4.
- COCO test images, ResNet-101-FPN: 35.7 mask AP at 5 fps.
- Cityscapes test: 32.0 AP.

Ablations
- RoIAlign vs. RoIPool: mask AP improves by relative 10% to 50%.
- Localization metric: RoIAlign gains grow at stricter .
- Feature stride: stride-32 C5 with RoIAlign beats stride-16 C4, 30.9 AP vs. 30.3 AP.
- Mask parameterization: class-agnostic masks match class-specific masks closely.
Method Strengths and Weaknesses
Strengths
- Adds masks to Faster R-CNN with one parallel FCN head.
- RoIAlign fixes a concrete failure mode and yields large AP gains.
- Strong COCO accuracy: 35.7 mask AP with ResNet-101-FPN.
- Transfers to keypoints within the same instance-level framework.
Weaknesses
- Still depends on two-stage detection before mask prediction.
- Accuracy is highly sensitive to alignment details.
- Small-object performance lags medium and large objects.
- Reported speed uses heavy backbones, not lightweight real-time settings.
Suggestions from the authors
- Extend the framework to other instance-level recognition tasks.
- Explore stronger architectural variants on the same simple template.
- Use the method as a baseline for future segmentation systems.
- Transfer the framework to human pose estimation.
Links
Prior Papers
- @renFasterRCNN2015 — Mask R-CNN directly extends Faster R-CNN with a mask branch and RoIAlign.
- @heDeepResidualLearning2016 — ResNet provides the backbone family used in the main variants.
- @redmonYOLO2016 — a contemporaneous detection baseline that highlights the two-stage accuracy-speed tradeoff.
Further Papers
- @dosovitskiyViT2020 — transformer backbones become plausible successors for detection and segmentation pipelines.
- @liuSwinTransformer2021 — hierarchical transformer features directly target strong dense prediction backbones.