You Only Look Once: Unified, Real-Time Object Detection

Joseph Redmon, Santosh Divvala, Ross Girshick, Ali Farhadi

2016 · CVPR

You Only Look Once: Unified, Real-Time Object Detection

Problem

Framing

Two-stage detectors were accurate but slow and pipeline-fragmented. YOLO replaces proposals and per-region classification with one full-image regression network, reaching 45 FPS at 63.4 mAP on VOC 2007.

Currently Used Methods

Foundational

Proposed Method

Architecture

YOLO divides the image into an S×SS \times S grid. Each cell predicts BB boxes, box confidences, and CC conditional class probabilities, producing an S×S×(5B+C)S \times S \times (5B + C) output tensor. For VOC, S=7S=7, B=2B=2, C=20C=20, so the output is 7×7×307 \times 7 \times 30. The main model uses 24 convolutional layers plus 2 fully connected layers; Fast YOLO uses 9 convolutional layers.

Verified architecture figure: the page shows YOLO's unified detection pipeline, with a 5 \times 5 input grid, predicted boxes plus confidences, a class-probability map, and final detections.

Loss / Objective

The detector uses a multi-part sum-squared loss over box coordinates, objectness, and class probabilities.

L=  λcoordi=0S2j=0B1ijobj[(xix^i)2+(yiy^i)2]+λcoordi=0S2j=0B1ijobj[(wiw^i)2+(hih^i)2]+i=0S2j=0B1ijobj(CiC^i)2+λnoobji=0S2j=0B1ijnoobj(CiC^i)2+i=0S21iobjcclasses(pi(c)p^i(c))2.\begin{aligned} \mathcal{L} = &\; \lambda_{\mathrm{coord}} \sum_{i=0}^{S^2} \sum_{j=0}^{B} \mathbf{1}^{\mathrm{obj}}_{ij} \left[(x_i-\hat{x}_i)^2 + (y_i-\hat{y}_i)^2\right] \\ &+ \lambda_{\mathrm{coord}} \sum_{i=0}^{S^2} \sum_{j=0}^{B} \mathbf{1}^{\mathrm{obj}}_{ij} \left[(\sqrt{w_i}-\sqrt{\hat{w}_i})^2 + (\sqrt{h_i}-\sqrt{\hat{h}_i})^2\right] \\ &+ \sum_{i=0}^{S^2} \sum_{j=0}^{B} \mathbf{1}^{\mathrm{obj}}_{ij} (C_i-\hat{C}_i)^2 \\ &+ \lambda_{\mathrm{noobj}} \sum_{i=0}^{S^2} \sum_{j=0}^{B} \mathbf{1}^{\mathrm{noobj}}_{ij} (C_i-\hat{C}_i)^2 \\ &+ \sum_{i=0}^{S^2} \mathbf{1}^{\mathrm{obj}}_{i} \sum_{c \in \mathrm{classes}} \left(p_i(c)-\hat{p}_i(c)\right)^2 . \end{aligned}

Sampling Rule / Algorithm

At test time, YOLO multiplies conditional class probabilities by box confidence to form class-specific scores.

Pr(ClassiObject)Pr(Object)IOUpredtruth=Pr(Classi)IOUpredtruth.\Pr(\mathrm{Class}_i \mid \mathrm{Object}) \cdot \Pr(\mathrm{Object}) \cdot \mathrm{IOU}_{\mathrm{pred}}^{\mathrm{truth}} = \Pr(\mathrm{Class}_i) \cdot \mathrm{IOU}_{\mathrm{pred}}^{\mathrm{truth}} .

Training Procedure

Evaluation

Datasets

Metrics

Headline results

Table 1: VOC 2012 test per-class AP comparison

MethodmAPaerobikebirdboatbottlebuscarcatchaircowtabledoghorsembikepersonplantsheepsofatraintv
MR.RCNN_MORE_DATA [11]73.985.582.976.657.862.779.477.286.655.079.162.287.083.484.778.945.373.465.880.374.0
HyperNet_VGG71.484.278.573.655.653.778.779.887.749.674.952.186.081.783.381.848.673.559.479.965.7
HyperNet_SP71.384.178.373.355.553.678.679.687.549.574.952.185.681.683.281.648.473.259.379.765.6
Fast R-CNN + YOLO70.783.478.573.555.843.479.173.189.449.475.557.087.580.981.074.741.871.568.582.167.2
MR.RCNN_S_CNN [11]70.785.079.671.555.357.776.073.984.650.574.361.785.579.981.776.441.069.061.277.772.1
Faster R-CNN [28]70.484.979.874.353.949.877.575.988.545.677.155.386.981.780.979.640.172.660.981.261.5
DEEP_ENS_COCO70.184.079.471.651.951.174.172.188.648.373.457.886.180.080.770.446.669.668.875.971.4
NoC [29]68.882.879.071.652.353.774.169.084.946.974.353.185.081.379.572.238.972.459.576.768.1
Fast R-CNN [14]68.482.378.470.852.338.777.871.689.344.273.055.087.580.580.872.035.168.365.780.464.2
UMICH_FGS_STRUCT66.482.976.164.144.649.470.371.284.642.768.655.882.777.179.968.741.469.060.072.066.2
NUS_NIN_C2000 [7]63.880.273.861.943.743.070.367.680.741.969.751.778.275.276.965.138.668.358.068.763.3
BabyLearning [7]63.278.074.261.345.742.768.266.880.240.670.049.879.074.577.964.035.367.955.768.762.6
NUS_NIN62.477.973.162.639.543.369.166.478.939.168.150.077.271.376.164.738.466.956.266.962.7
R-CNN VGG BB [13]62.479.672.761.941.241.965.966.484.638.567.246.782.074.876.065.235.665.454.267.460.3
R-CNN VGG [13]59.276.870.956.637.536.962.963.681.135.764.343.980.471.674.060.030.863.452.063.558.7
YOLO57.977.067.257.738.322.768.355.981.436.260.848.577.272.371.363.528.952.254.873.950.8
Feature Edit [33]56.374.669.154.439.133.165.262.769.730.856.044.670.064.471.160.233.361.346.461.757.8
R-CNN BB [13]53.371.865.852.034.132.659.660.069.827.652.041.769.661.368.357.829.657.840.959.354.1
SDS [16]50.769.758.448.528.328.861.357.570.824.150.735.964.959.165.857.126.058.838.658.950.7
R-CNN [13]49.668.163.846.129.427.956.657.065.926.548.739.566.257.365.453.226.254.538.150.651.6

Ablations

Method Strengths and Weaknesses

Strengths

Weaknesses

Suggestions from the authors

Links

Prior Papers

Further Papers