Mask R-CNN

Kaiming He, Georgia Gkioxari, Piotr Dollar, Ross Girshick

2017 · ICCV

Mask R-CNN

Problem

Framing

Instance segmentation needed detector-grade recognition and pixel-accurate masks in one model, but RoIPool broke RoI-to-pixel alignment. Mask R-CNN closes this with a parallel mask head plus quantization-free RoIAlign, reaching 35.7 mask AP on COCO test-dev at 5 fps.

Currently Used Methods

Foundational

Proposed Method

Architecture

Mask R-CNN extends Faster R-CNN with a third per-RoI branch that predicts an m×mm \times m binary mask. The head is a small FCN on RoI-aligned features, instantiated with ResNet-C4 or ResNet-FPN detectors.

Architecture: the page shows the core Mask R-CNN pipeline, with RoIAlign feeding parallel class-box and convolutional mask branches.

Loss / Objective

The objective adds a per-pixel mask loss on positive RoIs.

L=Lcls+Lbox+LmaskL = L_{\mathrm{cls}} + L_{\mathrm{box}} + L_{\mathrm{mask}} Lmask=1m21i,jm[yijlogy^ijk+(1yij)log(1y^ijk)]L_{\mathrm{mask}} = - \frac{1}{m^2} \sum_{1 \le i,j \le m} \left[ y_{ij} \log \hat{y}_{ij}^{k} + (1-y_{ij}) \log \left(1-\hat{y}_{ij}^{k}\right) \right]

Algorithm

RoIAlign removes coordinate quantization by sampling exact floating-point locations with bilinear interpolation.

f(x,y)=ijwij(x,y)Fij\mathbf{f}(x,y) = \sum_{i} \sum_{j} w_{ij}(x,y)\, \mathbf{F}_{ij}

Training Procedure

Evaluation

Datasets

Metrics

Headline results

Sample grid: COCO test images with predicted instance masks from a ResNet-101-FPN Mask R-CNN model.

Ablations

Method Strengths and Weaknesses

Strengths

Weaknesses

Suggestions from the authors

Links

Prior Papers

Further Papers