Mask R-CNN

Kaiming He, Georgia Gkioxari, Piotr Dollar, Ross Girshick

2017 · ICCV

Mask R-CNN

Problem

Framing

Instance segmentation needed detector-grade recognition and pixel-accurate masks in one model, but RoIPool broke RoI-to-pixel alignment. Mask R-CNN closes this with a parallel mask head plus quantization-free RoIAlign, reaching 35.7 mask AP on COCO test-dev at 5 fps.

Currently Used Methods

Foundational

@renFasterRCNN2015 — two-stage detection with shared RoI features.
- Limitation in context: no mask branch, and RoIPool misaligns pixel outputs.
Fully Convolutional Networks for Semantic Segmentation — dense semantic labeling with fully convolutional prediction.
- Limitation in context: no instance separation for overlapping objects.
Multi-task Network Cascades — cascaded detection, box, and mask stages.
- Limitation in context: more complex pipeline and lower COCO mask AP.
FCIS: Fully Convolutional Instance-aware Semantic Segmentation — position-sensitive score maps for instance masks.
- Limitation in context: overlap artifacts and weaker COCO results.
@heDeepResidualLearning2016 — residual backbones for strong visual features.
- Limitation in context: backbone strength alone does not fix alignment.

Proposed Method

Architecture

Mask R-CNN extends Faster R-CNN with a third per-RoI branch that predicts an $m \times m$ binary mask. The head is a small FCN on RoI-aligned features, instantiated with ResNet-C4 or ResNet-FPN detectors.

Architecture: the page shows the core Mask R-CNN pipeline, with RoIAlign feeding parallel class-box and convolutional mask branches.

Loss / Objective

The objective adds a per-pixel mask loss on positive RoIs.

L = L_{\mathrm{cls}} + L_{\mathrm{box}} + L_{\mathrm{mask}}

L_{\mathrm{mask}} = - \frac{1}{m^2} \sum_{1 \le i,j \le m} \left[ y_{ij} \log \hat{y}_{ij}^{k} + (1-y_{ij}) \log \left(1-\hat{y}_{ij}^{k}\right) \right]

Algorithm

RoIAlign removes coordinate quantization by sampling exact floating-point locations with bilinear interpolation.

\mathbf{f}(x,y) = \sum_{i} \sum_{j} w_{ij}(x,y)\, \mathbf{F}_{ij}

Training Procedure

Backbone: ResNet-50/101-C4 or ResNet-50/101-FPN.
Mask resolution: $m=28$ .
Training data: COCO trainval35k; ablations on minival.
Image scale: shorter side sampled from $[640, 800]$ .
Inference: top 100 detections per image.
Reported throughput: 5 fps for ResNet-101-FPN.

Evaluation

Datasets

COCO trainval35k, minival, test-dev.
COCO person keypoints.
Cityscapes test.

Metrics

COCO mask AP.
$\mathrm{AP}_{50}$ and $\mathrm{AP}_{75}$ .
$\mathrm{AP}_S$ , $\mathrm{AP}_M$ , $\mathrm{AP}_L$ .
Box AP.
Keypoint AP.

Headline results

COCO test-dev, ResNet-101-C4: mask AP 33.1, $\mathrm{AP}_{50}$ 54.9, $\mathrm{AP}_{75}$ 34.8.
COCO test-dev, ResNet-101-FPN: mask AP 35.7, $\mathrm{AP}_{50}$ 58.0, $\mathrm{AP}_{75}$ 37.8.
COCO test-dev, ResNeXt-101-FPN: mask AP 37.1, $\mathrm{AP}_{50}$ 60.0, $\mathrm{AP}_{75}$ 39.4.
COCO test images, ResNet-101-FPN: 35.7 mask AP at 5 fps.
Cityscapes test: 32.0 AP.

Sample grid: COCO test images with predicted instance masks from a ResNet-101-FPN Mask R-CNN model.

Ablations

RoIAlign vs. RoIPool: mask AP improves by relative 10% to 50%.
Localization metric: RoIAlign gains grow at stricter $\mathrm{AP}_{75}$ .
Feature stride: stride-32 C5 with RoIAlign beats stride-16 C4, 30.9 AP vs. 30.3 AP.
Mask parameterization: class-agnostic masks match class-specific masks closely.

Method Strengths and Weaknesses

Strengths

Adds masks to Faster R-CNN with one parallel FCN head.
RoIAlign fixes a concrete failure mode and yields large AP gains.
Strong COCO accuracy: 35.7 mask AP with ResNet-101-FPN.
Transfers to keypoints within the same instance-level framework.

Weaknesses

Still depends on two-stage detection before mask prediction.
Accuracy is highly sensitive to alignment details.
Small-object performance lags medium and large objects.
Reported speed uses heavy backbones, not lightweight real-time settings.

Suggestions from the authors

Extend the framework to other instance-level recognition tasks.
Explore stronger architectural variants on the same simple template.
Use the method as a baseline for future segmentation systems.
Transfer the framework to human pose estimation.

Mask R-CNN

Mask R-CNN

Problem

Framing

Currently Used Methods

Foundational

Proposed Method

Architecture

Loss / Objective

Algorithm

Training Procedure

Evaluation

Datasets

Metrics

Headline results

Ablations

Method Strengths and Weaknesses

Strengths

Weaknesses

Suggestions from the authors

Links

Prior Papers

Further Papers