You Only Look Once: Unified, Real-Time Object Detection
You Only Look Once: Unified, Real-Time Object Detection
Problem
Framing
Two-stage detectors were accurate but slow and pipeline-fragmented. YOLO replaces proposals and per-region classification with one full-image regression network, reaching 45 FPS at 63.4 mAP on VOC 2007.
Currently Used Methods
Foundational
- @renFasterRCNN2015 — proposal-based CNN detector with end-to-end region refinement.
- Limitation in context: proposal generation still dominates latency.
- "Rich feature hierarchies for accurate object detection and semantic segmentation" — multi-stage region-based CNN detection.
- Limitation in context: separate proposal, feature, classifier, and regressor stages fragment optimization.
- "Fast R-CNN" — shared convolutions over region proposals improve detector efficiency.
- Limitation in context: still inherits external proposal bottlenecks.
- "Histograms of Oriented Gradients for Human Detection" with DPM variants — sliding-window part-based detection.
- Limitation in context: weaker accuracy and poor global scene reasoning.
Proposed Method
Architecture
YOLO divides the image into an grid. Each cell predicts boxes, box confidences, and conditional class probabilities, producing an output tensor. For VOC, , , , so the output is . The main model uses 24 convolutional layers plus 2 fully connected layers; Fast YOLO uses 9 convolutional layers.

Loss / Objective
The detector uses a multi-part sum-squared loss over box coordinates, objectness, and class probabilities.
Sampling Rule / Algorithm
At test time, YOLO multiplies conditional class probabilities by box confidence to form class-specific scores.
Training Procedure
- Pretrain the first 20 convolutional layers on ImageNet.
- Pretraining input: .
- Detection input: .
- Main model: 24 convolutional layers, 2 fully connected layers.
- Fast YOLO: 9 convolutional layers.
- Final-layer activation: linear.
- Hidden-layer activation: leaky ReLU, if , else .
- Loss weights: , .
Evaluation
Datasets
- PASCAL VOC 2007
- PASCAL VOC 2012
- Picasso Dataset
- People-Art Dataset
Metrics
- mAP
- Per-class AP
- FPS
- Error breakdown by background and localization failures
Headline results
- VOC 2007 test: YOLO 63.4 mAP, 45 FPS.
- VOC 2007 test: Fast YOLO 52.7 mAP, 155 FPS.
- VOC 2012 test: YOLO 57.9 mAP.
- VOC 2012 test: Fast R-CNN + YOLO 70.7 mAP.
- VOC 2012 ensemble: top combined model reaches 75.0 mAP.
Table 1: VOC 2012 test per-class AP comparison
| Method | mAP | aero | bike | bird | boat | bottle | bus | car | cat | chair | cow | table | dog | horse | mbike | person | plant | sheep | sofa | train | tv |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| MR.RCNN_MORE_DATA [11] | 73.9 | 85.5 | 82.9 | 76.6 | 57.8 | 62.7 | 79.4 | 77.2 | 86.6 | 55.0 | 79.1 | 62.2 | 87.0 | 83.4 | 84.7 | 78.9 | 45.3 | 73.4 | 65.8 | 80.3 | 74.0 |
| HyperNet_VGG | 71.4 | 84.2 | 78.5 | 73.6 | 55.6 | 53.7 | 78.7 | 79.8 | 87.7 | 49.6 | 74.9 | 52.1 | 86.0 | 81.7 | 83.3 | 81.8 | 48.6 | 73.5 | 59.4 | 79.9 | 65.7 |
| HyperNet_SP | 71.3 | 84.1 | 78.3 | 73.3 | 55.5 | 53.6 | 78.6 | 79.6 | 87.5 | 49.5 | 74.9 | 52.1 | 85.6 | 81.6 | 83.2 | 81.6 | 48.4 | 73.2 | 59.3 | 79.7 | 65.6 |
| Fast R-CNN + YOLO | 70.7 | 83.4 | 78.5 | 73.5 | 55.8 | 43.4 | 79.1 | 73.1 | 89.4 | 49.4 | 75.5 | 57.0 | 87.5 | 80.9 | 81.0 | 74.7 | 41.8 | 71.5 | 68.5 | 82.1 | 67.2 |
| MR.RCNN_S_CNN [11] | 70.7 | 85.0 | 79.6 | 71.5 | 55.3 | 57.7 | 76.0 | 73.9 | 84.6 | 50.5 | 74.3 | 61.7 | 85.5 | 79.9 | 81.7 | 76.4 | 41.0 | 69.0 | 61.2 | 77.7 | 72.1 |
| Faster R-CNN [28] | 70.4 | 84.9 | 79.8 | 74.3 | 53.9 | 49.8 | 77.5 | 75.9 | 88.5 | 45.6 | 77.1 | 55.3 | 86.9 | 81.7 | 80.9 | 79.6 | 40.1 | 72.6 | 60.9 | 81.2 | 61.5 |
| DEEP_ENS_COCO | 70.1 | 84.0 | 79.4 | 71.6 | 51.9 | 51.1 | 74.1 | 72.1 | 88.6 | 48.3 | 73.4 | 57.8 | 86.1 | 80.0 | 80.7 | 70.4 | 46.6 | 69.6 | 68.8 | 75.9 | 71.4 |
| NoC [29] | 68.8 | 82.8 | 79.0 | 71.6 | 52.3 | 53.7 | 74.1 | 69.0 | 84.9 | 46.9 | 74.3 | 53.1 | 85.0 | 81.3 | 79.5 | 72.2 | 38.9 | 72.4 | 59.5 | 76.7 | 68.1 |
| Fast R-CNN [14] | 68.4 | 82.3 | 78.4 | 70.8 | 52.3 | 38.7 | 77.8 | 71.6 | 89.3 | 44.2 | 73.0 | 55.0 | 87.5 | 80.5 | 80.8 | 72.0 | 35.1 | 68.3 | 65.7 | 80.4 | 64.2 |
| UMICH_FGS_STRUCT | 66.4 | 82.9 | 76.1 | 64.1 | 44.6 | 49.4 | 70.3 | 71.2 | 84.6 | 42.7 | 68.6 | 55.8 | 82.7 | 77.1 | 79.9 | 68.7 | 41.4 | 69.0 | 60.0 | 72.0 | 66.2 |
| NUS_NIN_C2000 [7] | 63.8 | 80.2 | 73.8 | 61.9 | 43.7 | 43.0 | 70.3 | 67.6 | 80.7 | 41.9 | 69.7 | 51.7 | 78.2 | 75.2 | 76.9 | 65.1 | 38.6 | 68.3 | 58.0 | 68.7 | 63.3 |
| BabyLearning [7] | 63.2 | 78.0 | 74.2 | 61.3 | 45.7 | 42.7 | 68.2 | 66.8 | 80.2 | 40.6 | 70.0 | 49.8 | 79.0 | 74.5 | 77.9 | 64.0 | 35.3 | 67.9 | 55.7 | 68.7 | 62.6 |
| NUS_NIN | 62.4 | 77.9 | 73.1 | 62.6 | 39.5 | 43.3 | 69.1 | 66.4 | 78.9 | 39.1 | 68.1 | 50.0 | 77.2 | 71.3 | 76.1 | 64.7 | 38.4 | 66.9 | 56.2 | 66.9 | 62.7 |
| R-CNN VGG BB [13] | 62.4 | 79.6 | 72.7 | 61.9 | 41.2 | 41.9 | 65.9 | 66.4 | 84.6 | 38.5 | 67.2 | 46.7 | 82.0 | 74.8 | 76.0 | 65.2 | 35.6 | 65.4 | 54.2 | 67.4 | 60.3 |
| R-CNN VGG [13] | 59.2 | 76.8 | 70.9 | 56.6 | 37.5 | 36.9 | 62.9 | 63.6 | 81.1 | 35.7 | 64.3 | 43.9 | 80.4 | 71.6 | 74.0 | 60.0 | 30.8 | 63.4 | 52.0 | 63.5 | 58.7 |
| YOLO | 57.9 | 77.0 | 67.2 | 57.7 | 38.3 | 22.7 | 68.3 | 55.9 | 81.4 | 36.2 | 60.8 | 48.5 | 77.2 | 72.3 | 71.3 | 63.5 | 28.9 | 52.2 | 54.8 | 73.9 | 50.8 |
| Feature Edit [33] | 56.3 | 74.6 | 69.1 | 54.4 | 39.1 | 33.1 | 65.2 | 62.7 | 69.7 | 30.8 | 56.0 | 44.6 | 70.0 | 64.4 | 71.1 | 60.2 | 33.3 | 61.3 | 46.4 | 61.7 | 57.8 |
| R-CNN BB [13] | 53.3 | 71.8 | 65.8 | 52.0 | 34.1 | 32.6 | 59.6 | 60.0 | 69.8 | 27.6 | 52.0 | 41.7 | 69.6 | 61.3 | 68.3 | 57.8 | 29.6 | 57.8 | 40.9 | 59.3 | 54.1 |
| SDS [16] | 50.7 | 69.7 | 58.4 | 48.5 | 28.3 | 28.8 | 61.3 | 57.5 | 70.8 | 24.1 | 50.7 | 35.9 | 64.9 | 59.1 | 65.8 | 57.1 | 26.0 | 58.8 | 38.6 | 58.9 | 50.7 |
| R-CNN [13] | 49.6 | 68.1 | 63.8 | 46.1 | 29.4 | 27.9 | 56.6 | 57.0 | 65.9 | 26.5 | 48.7 | 39.5 | 66.2 | 57.3 | 65.4 | 53.2 | 26.2 | 54.5 | 38.1 | 50.6 | 51.6 |
Ablations
- Error type: YOLO makes less than half the background errors of Fast R-CNN.
- Error type: YOLO makes more localization errors, especially on small objects.
- Model size: Fast YOLO drops 10.7 mAP and reaches 155 FPS.
- System combination: adding YOLO to Fast R-CNN lifts VOC 2012 mAP by 2.3 points.
Method Strengths and Weaknesses
Strengths
- Single network removes proposals and region-wise rescoring.
- Real-time speed: 45 FPS at 63.4 mAP on VOC 2007.
- Background errors are less than half of Fast R-CNN.
- Transfers better to artwork on Picasso and People-Art.
Weaknesses
- Localization error remains the main failure mode.
- Small objects are hard to localize precisely.
- Grid-cell responsibility limits crowded-object handling.
- Sum-squared loss misaligns with mAP.
Suggestions from the authors
- Improve localization without losing real-time speed.
- Detect small objects more accurately.
- Relax coarse grid constraints on nearby objects.
- Extend unified detection to new domains and inputs.
Links
Prior Papers
- @renFasterRCNN2015 — Proposal-based detector that YOLO replaces with single-shot regression.
Further Papers
- @heMaskRCNN2017 — Later region-based detector on the same speed-accuracy frontier YOLO reshapes.