YOLO Object Detection — False Positives from Mismatch
YOLO false positives >0.
- YOLO reframes object detection as a single regression problem: one forward pass predicts bounding boxes and class probabilities simultaneously.
- Grid cells divide the image into an S×S grid; each cell predicts B bounding boxes and C class probabilities.
- Anchor boxes encode prior shapes to stabilize training — without them, box predictions drift.
- Non-Maximum Suppression (NMS) eliminates duplicate detections by IoU threshold, typically 0.5.
- Real-time performance: YOLOv8 runs at 100+ FPS on an NVIDIA T4 GPU, making it suitable for video.
- Production gotcha: NMS becomes the bottleneck at high frame rates — optimize with TensorRT or batch NMS.
Imagine you're a security guard watching a parking lot on a single TV screen. An old-school guard looks at every corner of the lot one piece at a time before calling anything suspicious — that takes ages. YOLO is the guard who glances at the whole screen once and instantly shouts 'there's a red car near gate 3, a person at gate 7, and a bike by the fence' — all in a single look. That's the entire secret: one forward pass through a neural network, and every object in the image is labelled and boxed simultaneously.
Every time your phone unlocks with your face, a Tesla decides not to brake for a shadow, or a warehouse robot grabs the right box off a conveyor belt, an object detector is running in the background. The demand for detectors that are both accurate and fast enough to run in real time has never been higher — and that tension between accuracy and speed is exactly where YOLO was born.
Before YOLO (You Only Look Once), the dominant paradigm was two-stage detection: a region-proposal network first suggests thousands of bounding-box candidates, then a separate classifier scores each one. Models like R-CNN and Faster R-CNN achieved excellent mean Average Precision (mAP), but their pipeline was fundamentally serial. On a 2015 GPU, Faster R-CNN ran at roughly 7 frames per second — nowhere near the 30+ fps required for real-time video. YOLO reframed detection as a single regression problem, collapsing both stages into one convolutional network pass and hitting 45 fps on the same hardware.
By the end of this article you'll understand exactly how YOLO divides an image into a grid, predicts bounding boxes and class probabilities simultaneously, why anchor boxes exist and what goes wrong without them, how Non-Maximum Suppression cleans up overlapping detections, and what the loss function is actually penalising. You'll also run a complete YOLOv8 inference and fine-tuning pipeline, and walk away knowing the production gotchas that trip up even experienced ML engineers.
What is Object Detection — YOLO?
Object detection is the computer vision task of locating and classifying objects within an image. YOLO (You Only Look Once) treats it as a single regression problem directly from image pixels to bounding box coordinates and class probabilities. Unlike sliding-window or region-proposal approaches, YOLO divides the input image into an S×S grid. Each grid cell is responsible for predicting B bounding boxes and C class probabilities — all in one forward pass of a convolutional neural network.
Why does this matter? Because detection speed becomes frame-rate independent. Two-stage detectors first generate thousands of region proposals (slower) then classify each one. YOLO eliminates the region proposal step entirely, achieving real-time inference on consumer hardware.
- Instead of asking 'where are objects?' then separately asking 'what are they?', YOLO asks both at once.
- Grid cells are like a coarse heatmap — each cell owns a small region and predicts the objects inside it.
- The output tensor has shape S × S × (B × 5 + C) — compact and fast to decode.
- It's regression, not classification: the model learns to output coordinates and probabilities directly.
How YOLO Works: Grid Cells, Bounding Boxes, and Class Probabilities
At the core of YOLO is a uniform grid of size S×S overlaying the input image. Each grid cell predicts a fixed number of bounding boxes, each with a confidence score indicating how likely the box contains an object, along with the box's coordinates (tx, ty, tw, th) relative to the cell. Additionally, each cell predicts a vector of C class probabilities (softmax across classes). During inference, the model outputs a tensor of shape S×S×(B×5+C).
The bounding box coordinates are encoded relative to the grid cell: - tx, ty are offsets from the top-left corner of the cell (sigmoid to keep them within [0,1]) - tw, th are log-space scaling factors relative to anchor box dimensions - Confidence = P(Object) * IoU(pred, truth) — quantifies both presence and box accuracy.
Class probabilities are independent per grid cell, meaning each cell assigns a probability distribution over classes regardless of which bounding box is responsible. This means the final detection for a cell is a combination of the best bounding box (highest confidence) and the cell's class prediction.
Anchor Boxes: Why They Exist and What Goes Wrong Without Them
YOLO uses predefined anchor boxes (also called prior boxes) to help the model predict bounding box dimensions. Instead of predicting absolute width and height, the model predicts scaling factors (tw, th) relative to an anchor. This is critical because direct prediction of arbitrary box shapes leads to unstable gradients early in training — the model has to learn from scratch that boxes come in common aspect ratios (e.g., human: tall and thin, car: wide and short).
Anchor boxes are typically chosen by running k-means clustering on the training dataset's ground truth bounding box dimensions. For YOLOv5 and YOLOv8, anchors are automatically computed during training based on the data's bounding box shapes. The number of anchors per grid cell is usually 3 or 5.
During inference, each predicted bounding box is the anchor box scaled by the model's output. The final box is represented as (center_x, center_y, width, height) relative to the grid cell.
Loss Function: What YOLO Actually Minimizes
YOLO's loss function is a multi-part objective that balances localization, confidence, and classification. The original YOLO paper used sum-squared error, but modern versions (YOLOv3+) use a combination of: - Localization loss: measures the error in bounding box coordinates. Typically CloU or GIoU loss that captures overlap, distance, and aspect ratio. - Confidence loss: binary cross-entropy (BCE) for whether an object exists in the box. Positive samples are predicted boxes that match a ground truth (highest IoU), negative samples are those with low IoU. - Classification loss: BCE for each class (multi-label) — the model can predict multiple classes per box?
The loss is weighted to prioritize localization and classification over confidence. Modern YOLO implementations assign one positive anchor per ground truth (based on IoU threshold) and ignore anchors with intermediate IoU to reduce noise.
Class imbalance is handled using focal loss-like weighting: the confidence loss down-weights easy negatives.
- Box loss punishes misaligned boxes — it's the most weighted part of the loss.
- Confidence loss asks: 'Are you sure an object is here?' — easy negatives are down-weighted.
- Classification loss treats each class independently (multi-label) because a cell can only predict one class per box.
- The weight ratios (e.g., box:conf:cls = 0.05:1.0:0.5 in original YOLO) are critical and dataset-specific.
How to Read mAP: Precision-Recall Curves and IoU Thresholds
Mean Average Precision (mAP) is the de facto metric for object detection. It summarizes the precision-recall trade-off across all classes and IoU thresholds. Understanding how to read mAP is essential for debugging model performance and making deployment decisions.
Precision-Recall Curves: For each class, the model's detections are ranked by confidence. As you lower the confidence threshold, more detections are considered, increasing recall but potentially decreasing precision. The precision-recall curve plots precision at each recall level. Average Precision (AP) is the area under this curve (AUC). mAP is the mean of AP across all classes.
IoU Thresholds: A detection is considered a true positive only if its Intersection over Union (IoU) with a ground truth box exceeds a threshold. The COCO evaluation uses 10 IoU thresholds from 0.50 to 0.95 at 0.05 increments. mAP@0.5 uses IoU=0.5 (lenient), while mAP@0.5:0.95 averages across all thresholds (more stringent). For production, mAP@0.5 is often used for coarse localization, but mAP@0.75 or mAP@0.5:0.95 better reflect precise box fitting.
How to interpret: A higher mAP@0.5:0.95 indicates the model can both detect objects and draw tight boxes. If mAP@0.5 is high but mAP@0.5:0.95 is low, the model detects objects but boxes are poorly aligned — a common sign of anchor mismatch or resolution issues.
YOLO Implementation with Keras/TensorFlow
While the Ultralytics ecosystem is PyTorch-based, you can run YOLOv8 inference using TensorFlow via the KerasCV library or by exporting models to TensorFlow SavedModel format. This is useful for teams that standardise on TensorFlow Serving, TFLite on mobile, or TF.js in the browser.
KerasCV YOLOv8: The keras_cv package provides a YOLOV8Detector model pre-trained on COCO. It uses the same architecture as Ultralytics but implemented in pure Keras. Below we show how to load and run inference.
Export from PyTorch: Alternatively, export a trained Ultralytics model to TensorFlow via ONNX and then to TF. The ultralytics library supports export(format='saved_model') to generate a TensorFlow SavedModel directly.
TFLite on Edge: For edge deployment, convert the TensorFlow model to TFLite (FP16 or INT8) to run on ARM CPUs or Edge TPU.
model.export(format='saved_model') to get a TensorFlow model. Then load it with tf.saved_model.load() and run inference with model(inputs).Non-Maximum Suppression (NMS): Cleaning Up Overlapping Detections
Because multiple grid cells and anchor boxes can predict the same object, the model often outputs many duplicate bounding boxes around a single object. Non-Maximum Suppression (NMS) is the post-processing step that consolidates these into one detection per object.
NMS works by: 1. Sorting all detections by confidence score (highest first). 2. Selecting the highest-confidence detection. 3. For each remaining detection, compute Intersection over Union (IoU) with the selected box. 4. If IoU > threshold (typically 0.5), suppress (remove) the overlapping detection. 5. Repeat until no more detections remain.
This greedy algorithm is simple but O(n²) in the number of candidate boxes, making it a bottleneck at high frame rates. Variants like Soft-NMS (reduce confidence instead of removing) or Fast NMS (vectorized) are used in practice.
Modern YOLO implementations (YOLOv5, YOLOv8) include NMS within the model pipeline; you can control it via iou_thres and conf_thres parameters.
conf_thres early, or use a batched NMS implementation in TensorRT.YOLOv8 Architecture and Key Improvements Over Earlier Versions
YOLOv8 (the latest Ultralytics release) introduced several architectural improvements over YOLOv5: - Anchor-free detection head: The head predicts objectness (center probability) instead of box confidence, simplifying the output. - Decoupled head: Separate branches for classification and regression, improving convergence. - CSPDarknet backbone with improved cross-stage partial connections. - Ciou/DIoU loss for localization, and Distribution Focal Loss for quality-aware score assignment. - Data augmentation: Mosaic copy-paste, MixUp, etc. Training use the same pipeline but hyperparameters are tuned.
The model family includes Nano, Small, Medium, Large, and X-Large variants, targeting different latency/accuracy trade-offs. YOLOv8-X achieves ~53.7 mAP on COCO while running at 35 FPS on a V100.
YOLO Genealogy: Architecture vs Speed vs Accuracy
YOLO has evolved rapidly since its inception. Understanding the genealogy helps you choose the right version for your deployment constraints. Below is a comparison of major YOLO versions, focusing on architecture, speed, and accuracy.
| Version | Year | Backbone | Detection Head | Loss | COCO mAP@0.5:0.95 | Latency T4 FP16 (ms) |
|---|---|---|---|---|---|---|
| YOLOv5 | 2020 | CSPDarknet | Anchor-based, coupled | CloU + BCE | 50.2 (v5x) | 2.1 (v5n) |
| YOLOv6 | 2022 | EfficientRep | Anchor-based, decoupled | Varifocal Loss | 52.5 (v6L) | 1.5 (v6n) |
| YOLOv7 | 2022 | ELAN | Anchor-based | CloU + BCE | 56.8 (v7x) | 2.8 (v7x) |
| YOLOv8 | 2023 | CSPDarknet (C2f) | Anchor-free, decoupled | CloU + DFL | 53.7 (v8x) | 1.8 (v8n) |
| YOLOv9 | 2024 | CSPDarknet + PGI | Anchor-free with GELAN | CloU + DFL + activation | 55.1 (v9x) | 2.4 (v9n) |
| YOLOv10 | 2024 | CSPDarknet + NMS-free | NMS-free dual assignment | CloU + DFL | 54.2 (v10x) | 0.8 (v10n) |
| YOLOv11 | 2025 | CSPDarknet + task-specific | Anchor-free + DFL | CloU + DFL + distillation | 56.0 (v11x) | 1.2 (v11n) |
Note: Latency and mAP numbers are from official repositories on NVIDIA T4 GPU with FP16; actual values vary with batch size and environment.
Key Trends: Newer versions consistently improve accuracy and speed, but the gains for large models (x variants) are smaller. For edge deployment, YOLOv5n and YOLOv8n remain competitive due to their small footprint. YOLOv10 introduces NMS-free inference, which dramatically reduces latency without losing accuracy.
Production Gotchas and Deployment Best Practices
Deploying YOLO in production goes beyond training a model on COCO. Here are the most common pitfalls: - Input resolution mismatch: Training at 640×640 but inference at 416×416 changes the anchor box scales and degrades accuracy. Always match resolutions or recalibrate anchors. - Batch normalization in inference: If you export to ONNX/TensorRT, ensure batch normalization layers are fused. Misconfiguration leads to irreversible accuracy drop. - Preprocessing pipeline: Many production systems resize images by letterboxing (adding black bars to maintain aspect ratio). Forgetting to exclude black pixels from detection can cause false positives on borders. - NMS as bottleneck: As mentioned, NMS is O(N²). Use batched NMS or TensorRT's NMS plugin for real-time systems. - Model quantization: Converting to FP16 or INT8 can drop accuracy, especially for small objects. Test thoroughly before deploying.
Edge Deployment Benchmarks: YOLO on Embedded Devices
Deploying YOLO on edge devices (Jetson, Raspberry Pi, Coral, smartphone) requires balancing model size, accuracy, and power consumption. Below are typical benchmarks for popular edge hardware using YOLOv8 and YOLOv10 variants.
| Device | Model | Precision | Input Size | FPS | mAP@0.5 | Power (W) |
|---|---|---|---|---|---|---|
| Jetson Orin NX 16GB | YOLOv8n | FP16 | 640×640 | 180 | 44.5 | 10-25 |
| Jetson Orin NX 16GB | YOLOv8s | FP16 | 640×640 | 120 | 50.2 | 10-25 |
| Jetson Nano 4GB | YOLOv5n | FP16 | 320×320 | 40 | 33.0 | 5-10 |
| Raspberry Pi 5 | YOLOv8n | INT8 TFLite | 320×320 | 12 | 30.1 | 3-5 |
| Coral Edge TPU | YOLOv8n | INT8 TFLite | 320×320 | 30 | 28.5 | 2-4 |
| iPhone 15 Pro | YOLOv8n | CoreML FP16 | 640×640 | 45 | 44.0 | ~3 |
Note: FPS measured with batch size 1, post-processing included. mAP on COCO val2017. Power estimates at typical load.
- Jetson Orin NX delivers desktop-level performance for embedded use.
- Raspberry Pi is suitable for low-throughput applications (e.g., sporadic object detection).
- Coral Edge TPU offers excellent efficiency for its power budget but requires INT8 quantization, which degrades mAP.
- Mobile devices with CoreML or GPU delegates achieve good performance for real-time mobile apps.
False Positives at Scale: When YOLO Sees Objects That Aren't There
- Inference resolution must match training resolution exactly, or re-tune anchors.
- Training without negative examples leaks bias: the model learns to always detect something.
- Validate with a held-out set of typical production scenes before deployment.
Key takeaways
Common mistakes to avoid
5 patternsUsing default COCO anchor boxes without recomputing for custom data
Training with mosaic augmentation enabled for entire training run
mosaic=0.0 or using a learning rate schedule that turns it off. YOLOv8's close_mosaic parameter automates this.Deploying with mismatched inference resolution
Not tuning NMS IoU threshold for scene density
Forgetting to apply NMS at all during inference
model() call — if you export to ONNX, you may need to integrate an NMS layer manually.Interview Questions on This Topic
Explain the concept of anchor boxes in YOLO. Why are they needed, and how do they affect training stability?
Frequently Asked Questions
That's Deep Learning. Mark it forged?
10 min read · try the examples if you haven't