You Only Look Once: Unified, Real-Time Object Detection

Joseph Redmon, Santosh Divvala, Ross Girshick, Ali Farhadi

2016 · CVPR

You Only Look Once: Unified, Real-Time Object Detection

Problem

Framing

Two-stage detectors were accurate but slow and pipeline-fragmented. YOLO replaces proposals and per-region classification with one full-image regression network, reaching 45 FPS at 63.4 mAP on VOC 2007.

Currently Used Methods

Foundational

@renFasterRCNN2015 — proposal-based CNN detector with end-to-end region refinement.
- Limitation in context: proposal generation still dominates latency.
"Rich feature hierarchies for accurate object detection and semantic segmentation" — multi-stage region-based CNN detection.
- Limitation in context: separate proposal, feature, classifier, and regressor stages fragment optimization.
"Fast R-CNN" — shared convolutions over region proposals improve detector efficiency.
- Limitation in context: still inherits external proposal bottlenecks.
"Histograms of Oriented Gradients for Human Detection" with DPM variants — sliding-window part-based detection.
- Limitation in context: weaker accuracy and poor global scene reasoning.

Proposed Method

Architecture

YOLO divides the image into an $S \times S$ grid. Each cell predicts $B$ boxes, box confidences, and $C$ conditional class probabilities, producing an $S \times S \times (5B + C)$ output tensor. For VOC, $S=7$ , $B=2$ , $C=20$ , so the output is $7 \times 7 \times 30$ . The main model uses 24 convolutional layers plus 2 fully connected layers; Fast YOLO uses 9 convolutional layers.

$Verified architecture figure: the page shows YOLO's unified detection pipeline, with a 5 \times 5 input grid, predicted boxes plus confidences, a class-probability map, and final detections.$

Loss / Objective

The detector uses a multi-part sum-squared loss over box coordinates, objectness, and class probabilities.

\begin{aligned} \mathcal{L} = &\; \lambda_{\mathrm{coord}} \sum_{i=0}^{S^2} \sum_{j=0}^{B} \mathbf{1}^{\mathrm{obj}}_{ij} \left[(x_i-\hat{x}_i)^2 + (y_i-\hat{y}_i)^2\right] \\ &+ \lambda_{\mathrm{coord}} \sum_{i=0}^{S^2} \sum_{j=0}^{B} \mathbf{1}^{\mathrm{obj}}_{ij} \left[(\sqrt{w_i}-\sqrt{\hat{w}_i})^2 + (\sqrt{h_i}-\sqrt{\hat{h}_i})^2\right] \\ &+ \sum_{i=0}^{S^2} \sum_{j=0}^{B} \mathbf{1}^{\mathrm{obj}}_{ij} (C_i-\hat{C}_i)^2 \\ &+ \lambda_{\mathrm{noobj}} \sum_{i=0}^{S^2} \sum_{j=0}^{B} \mathbf{1}^{\mathrm{noobj}}_{ij} (C_i-\hat{C}_i)^2 \\ &+ \sum_{i=0}^{S^2} \mathbf{1}^{\mathrm{obj}}_{i} \sum_{c \in \mathrm{classes}} \left(p_i(c)-\hat{p}_i(c)\right)^2 . \end{aligned}

Sampling Rule / Algorithm

At test time, YOLO multiplies conditional class probabilities by box confidence to form class-specific scores.

\Pr(\mathrm{Class}_i \mid \mathrm{Object}) \cdot \Pr(\mathrm{Object}) \cdot \mathrm{IOU}_{\mathrm{pred}}^{\mathrm{truth}} = \Pr(\mathrm{Class}_i) \cdot \mathrm{IOU}_{\mathrm{pred}}^{\mathrm{truth}} .

Training Procedure

Pretrain the first 20 convolutional layers on ImageNet.
Pretraining input: $224 \times 224$ .
Detection input: $448 \times 448$ .
Main model: 24 convolutional layers, 2 fully connected layers.
Fast YOLO: 9 convolutional layers.
Final-layer activation: linear.
Hidden-layer activation: leaky ReLU, $\phi(x)=x$ if $x>0$ , else $0.1x$ .
Loss weights: $\lambda_{\mathrm{coord}}=5$ , $\lambda_{\mathrm{noobj}}=0.5$ .

Evaluation

Datasets

PASCAL VOC 2007
PASCAL VOC 2012
Picasso Dataset
People-Art Dataset

Metrics

mAP
Per-class AP
FPS
Error breakdown by background and localization failures

Headline results

VOC 2007 test: YOLO 63.4 mAP, 45 FPS.
VOC 2007 test: Fast YOLO 52.7 mAP, 155 FPS.
VOC 2012 test: YOLO 57.9 mAP.
VOC 2012 test: Fast R-CNN + YOLO 70.7 mAP.
VOC 2012 ensemble: top combined model reaches 75.0 mAP.

Table 1: VOC 2012 test per-class AP comparison

Method	mAP	aero	bike	bird	boat	bottle	bus	car	cat	chair	cow	table	dog	horse	mbike	person	plant	sheep	sofa	train	tv
MR.RCNN_MORE_DATA [11]	73.9	85.5	82.9	76.6	57.8	62.7	79.4	77.2	86.6	55.0	79.1	62.2	87.0	83.4	84.7	78.9	45.3	73.4	65.8	80.3	74.0
HyperNet_VGG	71.4	84.2	78.5	73.6	55.6	53.7	78.7	79.8	87.7	49.6	74.9	52.1	86.0	81.7	83.3	81.8	48.6	73.5	59.4	79.9	65.7
HyperNet_SP	71.3	84.1	78.3	73.3	55.5	53.6	78.6	79.6	87.5	49.5	74.9	52.1	85.6	81.6	83.2	81.6	48.4	73.2	59.3	79.7	65.6
Fast R-CNN + YOLO	70.7	83.4	78.5	73.5	55.8	43.4	79.1	73.1	89.4	49.4	75.5	57.0	87.5	80.9	81.0	74.7	41.8	71.5	68.5	82.1	67.2
MR.RCNN_S_CNN [11]	70.7	85.0	79.6	71.5	55.3	57.7	76.0	73.9	84.6	50.5	74.3	61.7	85.5	79.9	81.7	76.4	41.0	69.0	61.2	77.7	72.1
Faster R-CNN [28]	70.4	84.9	79.8	74.3	53.9	49.8	77.5	75.9	88.5	45.6	77.1	55.3	86.9	81.7	80.9	79.6	40.1	72.6	60.9	81.2	61.5
DEEP_ENS_COCO	70.1	84.0	79.4	71.6	51.9	51.1	74.1	72.1	88.6	48.3	73.4	57.8	86.1	80.0	80.7	70.4	46.6	69.6	68.8	75.9	71.4
NoC [29]	68.8	82.8	79.0	71.6	52.3	53.7	74.1	69.0	84.9	46.9	74.3	53.1	85.0	81.3	79.5	72.2	38.9	72.4	59.5	76.7	68.1
Fast R-CNN [14]	68.4	82.3	78.4	70.8	52.3	38.7	77.8	71.6	89.3	44.2	73.0	55.0	87.5	80.5	80.8	72.0	35.1	68.3	65.7	80.4	64.2
UMICH_FGS_STRUCT	66.4	82.9	76.1	64.1	44.6	49.4	70.3	71.2	84.6	42.7	68.6	55.8	82.7	77.1	79.9	68.7	41.4	69.0	60.0	72.0	66.2
NUS_NIN_C2000 [7]	63.8	80.2	73.8	61.9	43.7	43.0	70.3	67.6	80.7	41.9	69.7	51.7	78.2	75.2	76.9	65.1	38.6	68.3	58.0	68.7	63.3
BabyLearning [7]	63.2	78.0	74.2	61.3	45.7	42.7	68.2	66.8	80.2	40.6	70.0	49.8	79.0	74.5	77.9	64.0	35.3	67.9	55.7	68.7	62.6
NUS_NIN	62.4	77.9	73.1	62.6	39.5	43.3	69.1	66.4	78.9	39.1	68.1	50.0	77.2	71.3	76.1	64.7	38.4	66.9	56.2	66.9	62.7
R-CNN VGG BB [13]	62.4	79.6	72.7	61.9	41.2	41.9	65.9	66.4	84.6	38.5	67.2	46.7	82.0	74.8	76.0	65.2	35.6	65.4	54.2	67.4	60.3
R-CNN VGG [13]	59.2	76.8	70.9	56.6	37.5	36.9	62.9	63.6	81.1	35.7	64.3	43.9	80.4	71.6	74.0	60.0	30.8	63.4	52.0	63.5	58.7
YOLO	57.9	77.0	67.2	57.7	38.3	22.7	68.3	55.9	81.4	36.2	60.8	48.5	77.2	72.3	71.3	63.5	28.9	52.2	54.8	73.9	50.8
Feature Edit [33]	56.3	74.6	69.1	54.4	39.1	33.1	65.2	62.7	69.7	30.8	56.0	44.6	70.0	64.4	71.1	60.2	33.3	61.3	46.4	61.7	57.8
R-CNN BB [13]	53.3	71.8	65.8	52.0	34.1	32.6	59.6	60.0	69.8	27.6	52.0	41.7	69.6	61.3	68.3	57.8	29.6	57.8	40.9	59.3	54.1
SDS [16]	50.7	69.7	58.4	48.5	28.3	28.8	61.3	57.5	70.8	24.1	50.7	35.9	64.9	59.1	65.8	57.1	26.0	58.8	38.6	58.9	50.7
R-CNN [13]	49.6	68.1	63.8	46.1	29.4	27.9	56.6	57.0	65.9	26.5	48.7	39.5	66.2	57.3	65.4	53.2	26.2	54.5	38.1	50.6	51.6

Ablations

Error type: YOLO makes less than half the background errors of Fast R-CNN.
Error type: YOLO makes more localization errors, especially on small objects.
Model size: Fast YOLO drops 10.7 mAP and reaches 155 FPS.
System combination: adding YOLO to Fast R-CNN lifts VOC 2012 mAP by 2.3 points.

Method Strengths and Weaknesses

Strengths

Single network removes proposals and region-wise rescoring.
Real-time speed: 45 FPS at 63.4 mAP on VOC 2007.
Background errors are less than half of Fast R-CNN.
Transfers better to artwork on Picasso and People-Art.

Weaknesses

Localization error remains the main failure mode.
Small objects are hard to localize precisely.
Grid-cell responsibility limits crowded-object handling.
Sum-squared loss misaligns with mAP.

Suggestions from the authors

Improve localization without losing real-time speed.
Detect small objects more accurately.
Relax coarse grid constraints on nearby objects.
Extend unified detection to new domains and inputs.

You Only Look Once: Unified, Real-Time Object Detection

You Only Look Once: Unified, Real-Time Object Detection

Problem

Framing

Currently Used Methods

Foundational

Proposed Method

Architecture

Loss / Objective

Sampling Rule / Algorithm

Training Procedure

Evaluation

Datasets

Metrics

Headline results

Ablations

Method Strengths and Weaknesses

Strengths

Weaknesses

Suggestions from the authors

Links

Prior Papers

Further Papers