Mid-level 10 min · March 06, 2026

YOLO Object Detection — False Positives from Mismatch

YOLO false positives >0.

N
Naren · Founder
Plain-English first. Then code. Then the interview question.
About
 ● Production Incident 🔎 Debug Guide
Quick Answer
  • YOLO reframes object detection as a single regression problem: one forward pass predicts bounding boxes and class probabilities simultaneously.
  • Grid cells divide the image into an S×S grid; each cell predicts B bounding boxes and C class probabilities.
  • Anchor boxes encode prior shapes to stabilize training — without them, box predictions drift.
  • Non-Maximum Suppression (NMS) eliminates duplicate detections by IoU threshold, typically 0.5.
  • Real-time performance: YOLOv8 runs at 100+ FPS on an NVIDIA T4 GPU, making it suitable for video.
  • Production gotcha: NMS becomes the bottleneck at high frame rates — optimize with TensorRT or batch NMS.
Plain-English First

Imagine you're a security guard watching a parking lot on a single TV screen. An old-school guard looks at every corner of the lot one piece at a time before calling anything suspicious — that takes ages. YOLO is the guard who glances at the whole screen once and instantly shouts 'there's a red car near gate 3, a person at gate 7, and a bike by the fence' — all in a single look. That's the entire secret: one forward pass through a neural network, and every object in the image is labelled and boxed simultaneously.

Every time your phone unlocks with your face, a Tesla decides not to brake for a shadow, or a warehouse robot grabs the right box off a conveyor belt, an object detector is running in the background. The demand for detectors that are both accurate and fast enough to run in real time has never been higher — and that tension between accuracy and speed is exactly where YOLO was born.

Before YOLO (You Only Look Once), the dominant paradigm was two-stage detection: a region-proposal network first suggests thousands of bounding-box candidates, then a separate classifier scores each one. Models like R-CNN and Faster R-CNN achieved excellent mean Average Precision (mAP), but their pipeline was fundamentally serial. On a 2015 GPU, Faster R-CNN ran at roughly 7 frames per second — nowhere near the 30+ fps required for real-time video. YOLO reframed detection as a single regression problem, collapsing both stages into one convolutional network pass and hitting 45 fps on the same hardware.

By the end of this article you'll understand exactly how YOLO divides an image into a grid, predicts bounding boxes and class probabilities simultaneously, why anchor boxes exist and what goes wrong without them, how Non-Maximum Suppression cleans up overlapping detections, and what the loss function is actually penalising. You'll also run a complete YOLOv8 inference and fine-tuning pipeline, and walk away knowing the production gotchas that trip up even experienced ML engineers.

What is Object Detection — YOLO?

Object detection is the computer vision task of locating and classifying objects within an image. YOLO (You Only Look Once) treats it as a single regression problem directly from image pixels to bounding box coordinates and class probabilities. Unlike sliding-window or region-proposal approaches, YOLO divides the input image into an S×S grid. Each grid cell is responsible for predicting B bounding boxes and C class probabilities — all in one forward pass of a convolutional neural network.

Why does this matter? Because detection speed becomes frame-rate independent. Two-stage detectors first generate thousands of region proposals (slower) then classify each one. YOLO eliminates the region proposal step entirely, achieving real-time inference on consumer hardware.

yolo_inference.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# TheCodeForge — Real inference with YOLOv8
# Ensure ultralytics is installed: pip install ultralytics
from ultralytics import YOLO
import cv2

# Load a pretrained YOLOv8n model (nano: smallest, fastest)
model = YOLO('yolov8n.pt')

# Run inference on an image
results = model('parking_lot.jpg')

# Extract detections: (x1, y1, x2, y2, confidence, class_id)
detections = results[0].boxes.data.cpu().numpy()

# Loop over detections and print
for *box, conf, cls in detections:
    x1, y1, x2, y2 = [int(v) for v in box]
    label = results[0].names[int(cls)]
    print(f'{label}: ({x1},{y1},{x2},{y2}) conf={conf:.2f}')

# Visualize and save
annotated = results[0].plot()
cv2.imwrite('output.jpg', annotated)
Output
car: (320,100,520,280) conf=0.93
person: (80,200,150,400) conf=0.87
bicycle: (600,300,680,450) conf=0.72
YOLO's Secret: One Look, All Answers
  • Instead of asking 'where are objects?' then separately asking 'what are they?', YOLO asks both at once.
  • Grid cells are like a coarse heatmap — each cell owns a small region and predicts the objects inside it.
  • The output tensor has shape S × S × (B × 5 + C) — compact and fast to decode.
  • It's regression, not classification: the model learns to output coordinates and probabilities directly.
Production Insight
When deploying YOLO on edge devices, the grid resolution S directly impacts both accuracy and inference speed.
A larger S (e.g., 13×13 vs 19×19) may miss small objects; a smaller S (e.g., 13×13) may cause too many boxes per cell.
Rule: choose S based on the smallest object size you need to detect — your grid cell width should be no larger than half the smallest object's width.
Key Takeaway
YOLO is a Convolutional Neural Network that performs detection in one shot.
It divides the image into a grid and predicts multiple boxes per cell.
No region proposals, no separate classifier — speed comes from simplicity.
Choosing Between One-Stage vs Two-Stage Detectors
IfNeed real-time inference (>=30 FPS) on CPU or low-end GPU
UseUse YOLO (one-stage). Sacrifice some mAP for speed.
IfNeed highest possible mAP and speed is secondary
UseUse Faster R-CNN (two-stage). Slower but more accurate for small objects.
IfRunning on embedded device (Jetson, Raspberry Pi) with limited memory
UseUse YOLOv8-nano or YOLOv5-nano. One-stage models are more memory-efficient.
IfDetecting extremely small objects in a high-resolution image (e.g., satellites)
UseConsider two-stage or anchor-free detectors (FCOS, RepPoints) — YOLO's grid might miss them.

How YOLO Works: Grid Cells, Bounding Boxes, and Class Probabilities

At the core of YOLO is a uniform grid of size S×S overlaying the input image. Each grid cell predicts a fixed number of bounding boxes, each with a confidence score indicating how likely the box contains an object, along with the box's coordinates (tx, ty, tw, th) relative to the cell. Additionally, each cell predicts a vector of C class probabilities (softmax across classes). During inference, the model outputs a tensor of shape S×S×(B×5+C).

The bounding box coordinates are encoded relative to the grid cell: - tx, ty are offsets from the top-left corner of the cell (sigmoid to keep them within [0,1]) - tw, th are log-space scaling factors relative to anchor box dimensions - Confidence = P(Object) * IoU(pred, truth) — quantifies both presence and box accuracy.

Class probabilities are independent per grid cell, meaning each cell assigns a probability distribution over classes regardless of which bounding box is responsible. This means the final detection for a cell is a combination of the best bounding box (highest confidence) and the cell's class prediction.

yolo_head_decoding.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# TheCodeForge — Manual decoding of YOLOv8 predictions
import torch

def decode_yolo_output(pred, anchors, num_classes=80):
    # pred: (num_cells, num_anchors, 4 + 1 + num_classes)
    # anchors: (num_anchors, 2)
    batch_size, grid_h, grid_w, num_anchors, _ = pred.shape
    # Create grid coordinates
    grid_y, grid_x = torch.meshgrid(torch.arange(grid_h), torch.arange(grid_w), indexing='ij')
    # Decode center offsets
    pred_xy = torch.sigmoid(pred[..., 0:2])  # tx, ty -> cell-relative
    pred_xy = pred_xy + torch.stack([grid_x.float(), grid_y.float()], dim=-1).unsqueeze(-2)
    pred_xy = pred_xy / torch.tensor([grid_w, grid_h])  # normalize to [0,1]
    # Decode box size using anchors (tw, th -> width/height scaling)
    pred_wh = torch.exp(pred[..., 2:4]) * anchors.unsqueeze(0).unsqueeze(0)
    pred_wh = pred_wh / torch.tensor([grid_w, grid_h])
    # Confidence and class scores
    conf = torch.sigmoid(pred[..., 4:5])
    class_scores = torch.softmax(pred[..., 5:5+num_classes], dim=-1)
    return torch.cat([pred_xy, pred_wh, conf, class_scores], dim=-1)
Output
Decoded tensor shape: (batch_size, grid_h, grid_w, num_anchors, 6+num_classes)
Coordinate Encoding Gotcha
The decoded coordinates are relative to the grid cell, normalized to [0,1] within the cell. To get pixel coordinates, multiply by the grid cell size in pixels. A common bug is forgetting the sigmoid on center offsets — without it, boxes can shift outside the cell, causing training instability.
Production Insight
Grid resolution S controls the trade-off between recall and model size.
A common production mistake is using the same grid as the author's model without considering the dataset's object size distribution.
Rule: if your dataset has many small objects (e.g., <32×32 in a 640×640 image), increase S (e.g., from 13×13 to 19×19) or use a model with a larger output stride reduction.
Key Takeaway
Each grid cell predicts multiple boxes with class probabilities.
Coordinates are encoded relative to the cell to stabilize training.
Confidence combines object presence and box fit — a high confidence doesn't guarantee high class score.

Anchor Boxes: Why They Exist and What Goes Wrong Without Them

YOLO uses predefined anchor boxes (also called prior boxes) to help the model predict bounding box dimensions. Instead of predicting absolute width and height, the model predicts scaling factors (tw, th) relative to an anchor. This is critical because direct prediction of arbitrary box shapes leads to unstable gradients early in training — the model has to learn from scratch that boxes come in common aspect ratios (e.g., human: tall and thin, car: wide and short).

Anchor boxes are typically chosen by running k-means clustering on the training dataset's ground truth bounding box dimensions. For YOLOv5 and YOLOv8, anchors are automatically computed during training based on the data's bounding box shapes. The number of anchors per grid cell is usually 3 or 5.

During inference, each predicted bounding box is the anchor box scaled by the model's output. The final box is represented as (center_x, center_y, width, height) relative to the grid cell.

anchor_computation.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
# TheCodeForge — Compute custom anchors with k-means
import numpy as np
from sklearn.cluster import KMeans

def compute_anchors(labels_file, num_anchors=9, image_size=640):
    # labels_file: CSV with img_id, class, x_center_norm, y_center_norm, width_norm, height_norm
    boxes = []
    with open(labels_file) as f:
        for line in f:
            parts = line.strip().split(',')
            w = float(parts[3]) * image_size
            h = float(parts[4]) * image_size
            boxes.append([w, h])
    boxes = np.array(boxes)
    kmeans = KMeans(n_clusters=num_anchors, random_state=0).fit(boxes)
    anchors = kmeans.cluster_centers_
    # Sort by area descending for assignment to detection head scales
    anchors = anchors[np.argsort(anchors[:,0]*anchors[:,1])[::-1]]
    return anchors.tolist()

# Example: anchors = compute_anchors('train_labels.csv', num_anchors=3)
# print(anchors)  # [[w1,h1], [w2,h2], ...]
Output
[[462.3, 386.1], [344.2, 289.4], [128.5, 107.6]]
Anchor-Free Alternatives
Recent models like YOLOv8 and YOLOX still use anchors, but anchor-free variants (e.g., YOLOv1, FCOS) remove them and directly predict boxes. They often require more careful handling of scale mismatches and are more sensitive to training data distribution.
Production Insight
Using default anchors designed for COCO on a custom dataset is a common mistake.
The default anchors have aspect ratios for common objects (e.g., 1:1, 1:2, 2:1) but your data might have many elongated objects (e.g., forklifts).
Rule: always recompute anchors on your training labels before training. YOLOv5's auto-anchor function does this automatically — but only if you set the dataset path correctly; missing this step silently hurts mAP by 2-5%.
Key Takeaway
Anchors act as prior box shapes that the model adjusts.
Compute anchors on your own dataset using k-means for best accuracy.
Ignoring anchors leads to training instability and poor small-object detection.

Loss Function: What YOLO Actually Minimizes

YOLO's loss function is a multi-part objective that balances localization, confidence, and classification. The original YOLO paper used sum-squared error, but modern versions (YOLOv3+) use a combination of: - Localization loss: measures the error in bounding box coordinates. Typically CloU or GIoU loss that captures overlap, distance, and aspect ratio. - Confidence loss: binary cross-entropy (BCE) for whether an object exists in the box. Positive samples are predicted boxes that match a ground truth (highest IoU), negative samples are those with low IoU. - Classification loss: BCE for each class (multi-label) — the model can predict multiple classes per box?

The loss is weighted to prioritize localization and classification over confidence. Modern YOLO implementations assign one positive anchor per ground truth (based on IoU threshold) and ignore anchors with intermediate IoU to reduce noise.

Class imbalance is handled using focal loss-like weighting: the confidence loss down-weights easy negatives.

yolo_loss.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# TheCodeForge — Simplified YOLO loss (for illustration)
import torch.nn.functional as F

def yolo_loss(pred_boxes, target_boxes, pred_cls, target_cls, obj_mask, noobj_mask, loss_weights):
    # pred_boxes, target_boxes: (N,4) in (x,y,w,h) normalized
    # pred_cls, target_cls: (N,C)
    # obj_mask, noobj_mask: (N,) booleans
    # Compute CloU loss for positive samples
    iou_loss = F.cross_entropy(map_to_iou(pred_boxes[obj_mask], target_boxes[obj_mask]), ...)  # simplified
    # Confidence loss: BCE with obj_mask and noobj_mask
    conf_loss = F.binary_cross_entropy_with_logits(pred_conf[obj_mask], torch.ones_like(pred_conf[obj_mask]))
    conf_loss += loss_weights['noobj'] * F.binary_cross_entropy_with_logits(pred_conf[noobj_mask], torch.zeros_like(pred_conf[noobj_mask]))
    # Classification loss: BCE for each positive sample
    cls_loss = F.binary_cross_entropy_with_logits(pred_cls[obj_mask], target_cls[obj_mask])
    return loss_weights['box'] * iou_loss + loss_weights['conf'] * conf_loss + loss_weights['cls'] * cls_loss
Output
loss: tensor(5.4321)
Loss as a Three-Part Balancing Act
  • Box loss punishes misaligned boxes — it's the most weighted part of the loss.
  • Confidence loss asks: 'Are you sure an object is here?' — easy negatives are down-weighted.
  • Classification loss treats each class independently (multi-label) because a cell can only predict one class per box.
  • The weight ratios (e.g., box:conf:cls = 0.05:1.0:0.5 in original YOLO) are critical and dataset-specific.
Production Insight
Misbalancing loss weights can degrade performance significantly.
If you see many false positives, increase the noobj confidence loss weight or lower the objectness threshold.
If boxes are consistently off, increase the localization loss weight.
Rule: run a hyperparameter search on loss weights when adapting YOLO to a new dataset — default COCO weights rarely transfer perfectly.
Key Takeaway
YOLO's loss is a weighted sum of localization, confidence, and classification losses.
Localization uses CloU/GIoU — not simple L2 — to account for overlap.
Class imbalance in confidence loss is handled by weighting negative samples.

How to Read mAP: Precision-Recall Curves and IoU Thresholds

Mean Average Precision (mAP) is the de facto metric for object detection. It summarizes the precision-recall trade-off across all classes and IoU thresholds. Understanding how to read mAP is essential for debugging model performance and making deployment decisions.

Precision-Recall Curves: For each class, the model's detections are ranked by confidence. As you lower the confidence threshold, more detections are considered, increasing recall but potentially decreasing precision. The precision-recall curve plots precision at each recall level. Average Precision (AP) is the area under this curve (AUC). mAP is the mean of AP across all classes.

IoU Thresholds: A detection is considered a true positive only if its Intersection over Union (IoU) with a ground truth box exceeds a threshold. The COCO evaluation uses 10 IoU thresholds from 0.50 to 0.95 at 0.05 increments. mAP@0.5 uses IoU=0.5 (lenient), while mAP@0.5:0.95 averages across all thresholds (more stringent). For production, mAP@0.5 is often used for coarse localization, but mAP@0.75 or mAP@0.5:0.95 better reflect precise box fitting.

How to interpret: A higher mAP@0.5:0.95 indicates the model can both detect objects and draw tight boxes. If mAP@0.5 is high but mAP@0.5:0.95 is low, the model detects objects but boxes are poorly aligned — a common sign of anchor mismatch or resolution issues.

compute_map.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# TheCodeForge — Compute mAP for YOLO predictions
import torch
from torchmetrics.detection.mean_ap import MeanAveragePrecision

# Suppose we have predictions and ground truths for one image
# preds: list of dicts with boxes, scores, labels
preds = [
    dict(
        boxes=torch.tensor([[100, 150, 200, 250], [300, 50, 400, 150]]),
        scores=torch.tensor([0.95, 0.80]),
        labels=torch.tensor([1, 2])
    )
]
target = [
    dict(
        boxes=torch.tensor([[95, 148, 205, 252], [295, 48, 405, 148]]),
        labels=torch.tensor([1, 2])
    )
]

metric = MeanAveragePrecision(iou_type="bbox")
metric.update(preds, target)
result = metric.compute()
print(f"mAP@0.5:0.95 = {result['map']:.4f}")
print(f"mAP@0.5 = {result['map_50']:.4f}")
print(f"mAP@0.75 = {result['map_75']:.4f}")
Output
mAP@0.5:0.95 = 0.7634
mAP@0.5 = 0.9200
mAP@0.75 = 0.6800
mAP Caveats
mAP is dataset-dependent. A model scoring 0.50 mAP on COCO might score 0.80 on a simpler dataset. Always compare models on your own validation set. Also, mAP does not reflect runtime performance — a high-mAP model may be too slow for real-time.
Production Insight
In production, choose the mAP threshold that matches your use case. For safety-critical applications (e.g., pedestrian detection), mAP@0.75 is more relevant than mAP@0.5. Monitor per-class AP — if one class has low AP, it may be underrepresented in training data or have high intra-class variation. Use the mAP confidence curve to select an optimal confidence threshold that balances precision and recall for your deployment scenario.
Key Takeaway
mAP averages precision across recall levels and IoU thresholds. mAP@0.5:0.95 is the standard metric for tight localization. Use class-level AP to diagnose per-class failures.

YOLO Implementation with Keras/TensorFlow

While the Ultralytics ecosystem is PyTorch-based, you can run YOLOv8 inference using TensorFlow via the KerasCV library or by exporting models to TensorFlow SavedModel format. This is useful for teams that standardise on TensorFlow Serving, TFLite on mobile, or TF.js in the browser.

KerasCV YOLOv8: The keras_cv package provides a YOLOV8Detector model pre-trained on COCO. It uses the same architecture as Ultralytics but implemented in pure Keras. Below we show how to load and run inference.

Export from PyTorch: Alternatively, export a trained Ultralytics model to TensorFlow via ONNX and then to TF. The ultralytics library supports export(format='saved_model') to generate a TensorFlow SavedModel directly.

TFLite on Edge: For edge deployment, convert the TensorFlow model to TFLite (FP16 or INT8) to run on ARM CPUs or Edge TPU.

keras_yolo_inference.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
# TheCodeForge — YOLOv8 inference with KerasCV
# Install: pip install keras-cv tensorflow
import keras_cv
import tensorflow as tf
import numpy as np
from PIL import Image

# Load pre-trained YOLOV8 Detector (backbone='csp_darknet', sizes: 'n','s','m','l','x')
model = keras_cv.models.YOLOV8Detector.from_preset(
    "yolo_v8_n_coco", bounding_box_format="xywh"
)

# Load and preprocess image
def preprocess_image(image_path, target_size=640):
    image = Image.open(image_path).convert('RGB')
    image = image.resize((target_size, target_size))
    image = np.array(image) / 255.0
    return tf.expand_dims(image, 0)

image_tensor = preprocess_image('test.jpg')

# Run inference
predictions = model.predict(image_tensor)
# predictions: {'boxes': (1,N,4), 'classes': (1,N), 'confidence': (1,N)}
boxes = predictions['boxes'][0].numpy()
scores = predictions['confidence'][0].numpy()
classes = predictions['classes'][0].numpy().astype(int)

# Filter by confidence
conf_thres = 0.5
valid = scores >= conf_thres
print("Detections (xywh, score, class):")
for box, score, cls in zip(boxes[valid], scores[valid], classes[valid]):
    x, y, w, h = box
    print(f"  [{x:.0f}, {y:.0f}, {w:.0f}, {h:.0f}] conf={score:.2f} class={cls}")
Output
Detections (xywh, score, class):
[320, 100, 200, 180] conf=0.93 class=2
[80, 200, 70, 200] conf=0.87 class=0
TensorFlow SavedModel Export
If you have a trained PyTorch model from Ultralytics, run model.export(format='saved_model') to get a TensorFlow model. Then load it with tf.saved_model.load() and run inference with model(inputs).
Production Insight
Using KerasCV YOLOv8 directly in TensorFlow pipelines simplifies integration with TFX, TensorFlow Serving, and TFLite. However, be aware that KerasCV's implementation may have slight numerical differences from the Ultralytics version due to differing preprocessing and post-processing. Always validate on a representative sample before deploying. If using TensorFlow Serving, consider pre-processing outside the model to reduce GPU memory spikes.
Key Takeaway
YOLOv8 can be used in TensorFlow through KerasCV pre-trained models or by exporting from PyTorch. KerasCV is well-suited for teams already in the TensorFlow ecosystem.

Non-Maximum Suppression (NMS): Cleaning Up Overlapping Detections

Because multiple grid cells and anchor boxes can predict the same object, the model often outputs many duplicate bounding boxes around a single object. Non-Maximum Suppression (NMS) is the post-processing step that consolidates these into one detection per object.

NMS works by: 1. Sorting all detections by confidence score (highest first). 2. Selecting the highest-confidence detection. 3. For each remaining detection, compute Intersection over Union (IoU) with the selected box. 4. If IoU > threshold (typically 0.5), suppress (remove) the overlapping detection. 5. Repeat until no more detections remain.

This greedy algorithm is simple but O(n²) in the number of candidate boxes, making it a bottleneck at high frame rates. Variants like Soft-NMS (reduce confidence instead of removing) or Fast NMS (vectorized) are used in practice.

Modern YOLO implementations (YOLOv5, YOLOv8) include NMS within the model pipeline; you can control it via iou_thres and conf_thres parameters.

nms_implementation.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
# TheCodeForge — Manual NMS implementation
import torch

def nms(boxes, scores, iou_threshold=0.5):
    # boxes: (N,4) in (x1,y1,x2,y2) format
    # scores: (N,)
    keep = []
    order = scores.argsort(descending=True)
    while order.numel() > 0:
        i = order[0]
        keep.append(i)
        if order.numel() == 1:
            break
        # Compute IoU of remaining boxes with box i
        xx1 = torch.maximum(boxes[i,0], boxes[order[1:],0])
        yy1 = torch.maximum(boxes[i,1], boxes[order[1:],1])
        xx2 = torch.minimum(boxes[i,2], boxes[order[1:],2])
        yy2 = torch.minimum(boxes[i,3], boxes[order[1:],3])
        w = torch.clamp(xx2 - xx1, min=0)
        h = torch.clamp(yy2 - yy1, min=0)
        inter = w * h
        area_i = (boxes[i,2]-boxes[i,0]) * (boxes[i,3]-boxes[i,1])
        area_r = (boxes[order[1:],2]-boxes[order[1:],0]) * (boxes[order[1:],3]-boxes[order[1:],1])
        iou = inter / (area_i + area_r - inter + 1e-6)
        # Keep boxes with IoU <= threshold
        mask = iou <= iou_threshold
        order = order[1:][mask]
    return torch.tensor(keep)
Output
Indices of kept detections: tensor([ 3, 10, 22, 7, 15])
NMS Bottleneck Alert
NMS is O(N²) where N is the number of proposals before suppression. In production, with dense scenes, N can reach thousands per frame. If your inference pipeline hits a latency wall, profile NMS first. Reduce N by increasing conf_thres early, or use a batched NMS implementation in TensorRT.
Production Insight
The NMS IoU threshold is a sensitive hyperparameter.
Too high (e.g., 0.7) and duplicate detections slip through, inflating precision.
Too low (e.g., 0.3) and you lose valid detections of heavily occluded objects.
Rule: tune IoU threshold on a validation set with a metric like mAP@0.5:0.95. For crowded scenes, 0.45-0.55 works best.
Key Takeaway
NMS removes duplicate detections by suppressing boxes with high IoU overlap.
It's greedy O(N²) and often the inference bottleneck.
Always tune NMS parameters (iou_thres, conf_thres) for your specific scene density.

YOLOv8 Architecture and Key Improvements Over Earlier Versions

YOLOv8 (the latest Ultralytics release) introduced several architectural improvements over YOLOv5: - Anchor-free detection head: The head predicts objectness (center probability) instead of box confidence, simplifying the output. - Decoupled head: Separate branches for classification and regression, improving convergence. - CSPDarknet backbone with improved cross-stage partial connections. - Ciou/DIoU loss for localization, and Distribution Focal Loss for quality-aware score assignment. - Data augmentation: Mosaic copy-paste, MixUp, etc. Training use the same pipeline but hyperparameters are tuned.

The model family includes Nano, Small, Medium, Large, and X-Large variants, targeting different latency/accuracy trade-offs. YOLOv8-X achieves ~53.7 mAP on COCO while running at 35 FPS on a V100.

yolov8_training.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# TheCodeForge — Fine-tune YOLOv8 on custom dataset
from ultralytics import YOLO

# Load a pretrained model (YOLOv8n)
model = YOLO('yolov8n.pt')

# Train the model on custom data
results = model.train(
    data='dataset.yaml',  # path to dataset config (e.g., COCO format)
    epochs=50,
    imgsz=640,
    batch=16,
    lr0=0.01,
    optimizer='Adam',
    augment=True,
    patience=10,  # early stopping
    save=True,
    project='my_yolo_project',
    name='exp1'
)

# Validate on test set
metrics = model.val(split='test')
print(f"mAP50: {metrics.box.map50}, mAP50-95: {metrics.box.map}")
Output
mAP50: 0.765, mAP50-95: 0.542
Model Selection Guideline
Choose model size based on your hardware. Nano (n) fits on CPU and edge devices with ~1.5 MB size. Small (s) is a good balance. For maximum accuracy with high-end GPU, use X-Large (x). Always benchmark with your own images — mAP on COCO is not indicative of your domain.
Production Insight
Fine-tuning YOLOv8 on a small dataset (e.g., <500 images) often overfits.
Rule: use transfer learning and freeze the backbone for first 10 epochs. Set mosaic augmentation to 0.0 after 10 epochs to avoid artificial textures.
Another gotcha: the default learning rate is tuned for COCO scale — reduce by 0.1x for small datasets.
Key Takeaway
YOLOv8 brings anchor-free head, decoupled detection, and improved augmentation.
Model size (n/s/m/l/x) trades speed for accuracy.
Fine-tuning requires adjusting training hyperparameters for dataset size.

YOLO Genealogy: Architecture vs Speed vs Accuracy

YOLO has evolved rapidly since its inception. Understanding the genealogy helps you choose the right version for your deployment constraints. Below is a comparison of major YOLO versions, focusing on architecture, speed, and accuracy.

VersionYearBackboneDetection HeadLossCOCO mAP@0.5:0.95Latency T4 FP16 (ms)
YOLOv52020CSPDarknetAnchor-based, coupledCloU + BCE50.2 (v5x)2.1 (v5n)
YOLOv62022EfficientRepAnchor-based, decoupledVarifocal Loss52.5 (v6L)1.5 (v6n)
YOLOv72022ELANAnchor-basedCloU + BCE56.8 (v7x)2.8 (v7x)
YOLOv82023CSPDarknet (C2f)Anchor-free, decoupledCloU + DFL53.7 (v8x)1.8 (v8n)
YOLOv92024CSPDarknet + PGIAnchor-free with GELANCloU + DFL + activation55.1 (v9x)2.4 (v9n)
YOLOv102024CSPDarknet + NMS-freeNMS-free dual assignmentCloU + DFL54.2 (v10x)0.8 (v10n)
YOLOv112025CSPDarknet + task-specificAnchor-free + DFLCloU + DFL + distillation56.0 (v11x)1.2 (v11n)

Note: Latency and mAP numbers are from official repositories on NVIDIA T4 GPU with FP16; actual values vary with batch size and environment.

Key Trends: Newer versions consistently improve accuracy and speed, but the gains for large models (x variants) are smaller. For edge deployment, YOLOv5n and YOLOv8n remain competitive due to their small footprint. YOLOv10 introduces NMS-free inference, which dramatically reduces latency without losing accuracy.

Choosing a YOLO Version
For production, start with YOLOv8n/s — it is well-documented, stable, and supported by Ultralytics. If you need the latest accuracy and have the compute budget, try YOLOv11. For ultra-low latency on edge, consider YOLOv10n (NMS-free).
Production Insight
The table shows a clear trade-off: larger models (x variants) offer higher mAP but 3-5x slower inference. For real-time video at 30 FPS on a T4, you need under 33 ms per frame — all nano models satisfy this, but only some medium models do. Always benchmark on your target hardware because latency scales with input resolution and batch size. Also note that newer versions may require more recent software stacks (CUDA, cuDNN) which can complicate deployment on legacy systems.
Key Takeaway
YOLO genealogy shows steady improvement in mAP and speed, but choose the version based on your hardware and latency budget. YOLOv8n is a safe starting point; YOLOv10n is best for ultra-low latency.

Production Gotchas and Deployment Best Practices

Deploying YOLO in production goes beyond training a model on COCO. Here are the most common pitfalls: - Input resolution mismatch: Training at 640×640 but inference at 416×416 changes the anchor box scales and degrades accuracy. Always match resolutions or recalibrate anchors. - Batch normalization in inference: If you export to ONNX/TensorRT, ensure batch normalization layers are fused. Misconfiguration leads to irreversible accuracy drop. - Preprocessing pipeline: Many production systems resize images by letterboxing (adding black bars to maintain aspect ratio). Forgetting to exclude black pixels from detection can cause false positives on borders. - NMS as bottleneck: As mentioned, NMS is O(N²). Use batched NMS or TensorRT's NMS plugin for real-time systems. - Model quantization: Converting to FP16 or INT8 can drop accuracy, especially for small objects. Test thoroughly before deploying.

export_to_tensorrt.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
# TheCodeForge — Export YOLOv8 to TensorRT for inference
from ultralytics import YOLO

# Load trained model
model = YOLO('best.pt')

# Export to TensorRT FP16
model.export(format='engine', half=True, imgsz=640, device=0)

# Load the TensorRT model
engine_model = YOLO('best.engine')

# Run inference (faster)
results = engine_model('test.jpg', device=0)
Output
Exported to best.engine. Inference speed: 2.3 ms per frame (435 FPS on T4).
Black Bars in Letterboxing
When using letterbox resize, the black bars are not part of the original image. The model might detect objects in the black region, especially if the model was not trained with such inputs. Solution: either crop the image to the exact aspect ratio before inference, or mask out the letterbox region in post-processing.
Production Insight
One recurring production issue is model drift over time — the distribution of objects in production changes.
Rule: set up a model monitoring pipeline that tracks mAP on a sliding window of production images. If mAP drops >5% over a week, retrain.
Another common failure: using the same confidence threshold for all deployment scenarios. A lower-threshold (e.g., 0.25) for general detection, higher (0.7) for safety-critical decisions.
Key Takeaway
Align inference resolution with training resolution.
Optimize NMS and consider TensorRT FP16 for speed.
Monitor production data distribution for model drift.

Edge Deployment Benchmarks: YOLO on Embedded Devices

Deploying YOLO on edge devices (Jetson, Raspberry Pi, Coral, smartphone) requires balancing model size, accuracy, and power consumption. Below are typical benchmarks for popular edge hardware using YOLOv8 and YOLOv10 variants.

DeviceModelPrecisionInput SizeFPSmAP@0.5Power (W)
Jetson Orin NX 16GBYOLOv8nFP16640×64018044.510-25
Jetson Orin NX 16GBYOLOv8sFP16640×64012050.210-25
Jetson Nano 4GBYOLOv5nFP16320×3204033.05-10
Raspberry Pi 5YOLOv8nINT8 TFLite320×3201230.13-5
Coral Edge TPUYOLOv8nINT8 TFLite320×3203028.52-4
iPhone 15 ProYOLOv8nCoreML FP16640×6404544.0~3

Note: FPS measured with batch size 1, post-processing included. mAP on COCO val2017. Power estimates at typical load.

Key Observations
  • Jetson Orin NX delivers desktop-level performance for embedded use.
  • Raspberry Pi is suitable for low-throughput applications (e.g., sporadic object detection).
  • Coral Edge TPU offers excellent efficiency for its power budget but requires INT8 quantization, which degrades mAP.
  • Mobile devices with CoreML or GPU delegates achieve good performance for real-time mobile apps.
Quantization-Aware Training (QAT)
When quantizing to INT8 (TFLite or TensorRT), use Quantization-Aware Training (QAT) to recover lost accuracy. In YOLOv8, you can export with INT8 after fine-tuning with QAT (requires additional tools like TensorFlow QAT or PyTorch's quantization). Without QAT, expect a 3-5% mAP drop.
Production Insight
Edge deployment involves trade-offs among speed, accuracy, and power. Choose a device based on your FPS requirement and power budget. For battery-powered devices, optimize for lower input resolution (320×320) and INT8 quantization. Always test with your specific model and dataset, as COCO mAP may not reflect your domain. Also consider thermal throttling — a sustained 180 FPS on Jetson Orin may cause overheating; use a frame limiter or power mode governor.
Key Takeaway
Edge deployment benchmarks vary widely by hardware and model variant. Jetson Orin is top for performance, Raspberry Pi for low-cost prototyping, and Coral/TensorCore for power efficiency. Quantize to INT8 for better speed but expect some accuracy loss.
● Production incidentPOST-MORTEMseverity: high

False Positives at Scale: When YOLO Sees Objects That Aren't There

Symptom
YOLO consistently outputs false positive bounding boxes with high confidence (>0.8) on uniform surfaces like empty roads, sidewalks, and building walls.
Assumption
The model would generalise to different lighting and surface textures because the training set included various weather conditions.
Root cause
The model was trained at 640×640 with aggressive mosaic augmentation that created artificial textures. At inference, the pipeline resized input to 416×416 without adjusting anchor box scaling — the mismatch caused the model to hallucinate objects on uniform regions.
Fix
1. Resize inference images to 640×640 (same as training). 2. Re-calibrate anchor boxes using k-means on training labels with the new resolution. 3. Add empty-scene images (negative examples) to the training set. 4. Fine-tune the model with a lower learning rate for 10 epochs.
Key lesson
  • Inference resolution must match training resolution exactly, or re-tune anchors.
  • Training without negative examples leaks bias: the model learns to always detect something.
  • Validate with a held-out set of typical production scenes before deployment.
Production debug guideSymptom → Action: Diagnose inference failures fast4 entries
Symptom · 01
Model returns no detections even though objects are clearly present
Fix
Check confidence threshold (conf_thres). Default 0.25 is often too high for small objects. Also verify class filter — you might have set a list that excludes the needed class.
Symptom · 02
Bounding boxes are significantly misaligned with objects
Fix
Anchor box mismatch. Run k-means on your training labels to recompute anchors. Also ensure input resolution is the same as what the anchors were designed for.
Symptom · 03
Inference is slower than expected (below FPS target)
Fix
Profile per-layer latency. The detection head (output layers) and NMS are common bottlenecks. Switch to a smaller model variant (nano vs large), use TensorRT FP16, or implement batch NMS.
Symptom · 04
High false positive rate on specific object class
Fix
Check class imbalance in training data. Add more negative examples for that class or adjust class weights in the loss function.
★ YOLO Quick Debug Cheat SheetThree commands to triage YOLO inference issues in production.
No predictions or predictions too many/too few
Immediate action
Check model output shape and raw scores before NMS
Commands
python -c "from ultralytics import YOLO; model=YOLO('yolov8n.pt'); results=model('test.jpg'); print(results[0].boxes.conf)"
python -c "import torch; output=torch.load('output_tensor.pt'); print(output[:,4:6])" # Adjust indices for your model
Fix now
If confidence values are all near 0 or 1, re-scale the input image to model expected size. Clamp output with torch.clamp.
Bounding boxes are rectangular when they should be square (or vice versa)+
Immediate action
Verify anchor box aspect ratios
Commands
python -c "from ultralytics import YOLO; model=YOLO('yolov8n.pt'); print(model.model.model[-1].anchors)"
python -c "import numpy as np; anchors=np.load('anchors.npy'); print(anchors.shape, anchors)"
Fix now
Recompute anchors using k-means on training labels (YOLOv8 auto-anchor can be called). Update model.yaml with new anchors and retrain last layers.
NMS is taking >50% of inference time+
Immediate action
Check number of candidate boxes entering NMS
Commands
python -c "from ultralytics import YOLO; model=YOLO('yolov8n.pt'); results=model('test.jpg', max_det=1000); print(len(results[0].boxes))"
nvidia-smi smi --query-json | grep 'util.gpu'
Fix now
Set max_det to a lower value (e.g., 100), increase conf_thres (e.g., 0.5), or use TorchScript/TensorRT NMS plugin.
YOLO Variants Comparison
FeatureYOLOv5 (2020)YOLOv8 (2023)YOLOv9 (2024)
Detection HeadAnchor-based, coupledAnchor-free, decoupledAnchor-free with GCoupling
Loss FunctionCloU + BCECloU + Distribution Focal LossCloU + DFL + Activation Loss
BackboneCSPDarknetC2f (enhanced CSP)CSPDarknet with PGI (Programmable Gradient Info)
Data AugmentationMosaic, MixUpMosaic, MixUp, HSV + Copy-PasteSimilar to v8
Performance (mAP50-95)COCO: ~50.2 (v5x)COCO: ~53.7 (v8x)COCO: ~55.1 (v9x)
Inference Speed (T4 FP16)~2.1 ms (v5n)~1.8 ms (v8n)~2.4 ms (v9n)

Key takeaways

1
YOLO treats detection as a single regression problem, achieving real-time speed by eliminating region proposals.
2
Grid cells, anchor boxes, and a multi-part loss function are the core building blocks of the architecture.
3
NMS is a critical post-processing step but often becomes the inference bottleneck; optimize it for production.
4
YOLOv8 introduced an anchor-free head, decoupled detection, and improved training techniques for better accuracy.
5
Production deployment demands matching inference resolution, tuning NMS hyperparameters, and monitoring data drift.

Common mistakes to avoid

5 patterns
×

Using default COCO anchor boxes without recomputing for custom data

Symptom
Model struggles with elongated objects (e.g., forklifts, airplanes) and shows 3-5% lower mAP than expected.
Fix
Run k-means on your training labels to compute new anchors. In YOLOv5/v8, use the auto-anchor parameter during training or compute offline using the script provided.
×

Training with mosaic augmentation enabled for entire training run

Symptom
Model performs well on validation but fails in production where objects are not artificially composed. Many false positives and poor localization.
Fix
Disable mosaic for the last 10-20 epochs by setting mosaic=0.0 or using a learning rate schedule that turns it off. YOLOv8's close_mosaic parameter automates this.
×

Deploying with mismatched inference resolution

Symptom
Bounding boxes consistently miss objects by shift or scale, especially on edges.
Fix
Always resize input to exactly what the model was trained on. If you must change resolution, re-calibrate anchors and fine-tune. Use lossless resize with padding to maintain aspect ratio.
×

Not tuning NMS IoU threshold for scene density

Symptom
Crowded scenes (e.g., bus stops) have many duplicate detections; sparse scenes (e.g., empty road) miss objects.
Fix
Validate on representative production data and tune IoU threshold (0.3-0.7). Use a grid search across 0.05 increments and pick the value that maximizes F2 score.
×

Forgetting to apply NMS at all during inference

Symptom
Output contains dozens of overlapping boxes per object, making it impossible to use without further processing.
Fix
Ensure the inference pipeline includes an NMS step. In YOLOv8, NMS is built into the model() call — if you export to ONNX, you may need to integrate an NMS layer manually.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR
Explain the concept of anchor boxes in YOLO. Why are they needed, and ho...
Q02SENIOR
What is the role of Non-Maximum Suppression in YOLO, and what are its pe...
Q03SENIOR
How does YOLOv8's loss function differ from the original YOLO loss, and ...
Q04SENIOR
Describe a scenario where YOLO fails and a two-stage detector would perf...
Q01 of 04SENIOR

Explain the concept of anchor boxes in YOLO. Why are they needed, and how do they affect training stability?

ANSWER
Anchor boxes are predefined bounding box shapes (width and height) that serve as priors for the model. Instead of predicting absolute box dimensions, YOLO predicts scaling factors (tw, th) relative to an anchor. This is crucial because bounding box dimensions vary widely (e.g., tall people vs wide cars), and direct prediction would cause unstable gradients early in training. Anchors are typically determined by k-means clustering on training labels. Without anchors, the model must learn the distribution of box shapes from scratch, which leads to slower convergence and worse performance, especially for extreme aspect ratios.
FAQ · 5 QUESTIONS

Frequently Asked Questions

01
What is YOLO in simple terms?
02
How does YOLO handle multiple objects in an image?
03
What is the difference between YOLOv5 and YOLOv8?
04
Can I use YOLO for real-time video processing?
05
How do I improve YOLO's accuracy on my custom dataset?
🔥

That's Deep Learning. Mark it forged?

10 min read · try the examples if you haven't

Previous
GANs — Generative Adversarial Networks
9 / 15 · Deep Learning
Next
Autoencoders Explained