YOLO: An Introduction
most powerful algorithm to detect different kinds of objects in images.
efficiency+accuracy→yolo
localizes objects in an image efficiently.
performs object detection in just one forward pass through the neural network.
A single network pass (A pass in a neural network is a single iteration through the network, where the input data is propagated forward through the network and the output is calculated. They allow us to train the network to perform a specific task.) in YOLO involves passing an input image through the neural network, which consists of various convolutional layers, pooling layers, and fully connected layers. However, in traditional object detection models, such as R-CNN, multiple passes are required to generate region proposals, classify those proposals, and refine them.
Two types of passes in neural networks:
Forward (assess performance): performed by passing the input data through each layer of the network, one by one. At each layer, the input data is multiplied by the weights of the layer and then summed. The output of each layer is then passed to the next layer as input.
Backward (improves performance): performed by calculating error backward through the network and the weights of the network are updated concerning which gradient of loss function is determined. The goal is to minimize it. This process is called training.
network directly processes the input image and predicts bounding boxes and class probabilities for all objects in a single evaluation of the network.
🎯 YOLOv8 strikes a better balance between speed and accuracy.
Different models for different deployment scenarios: YOLOv8n, YOLOv8s, YOLOv8m, YOLOv8l, and YOLOv8x
The difference between the working of the five models of YOLOv8 lies in their size, speed, and accuracy.
YOLOv8 features:
anchor-free approach: predicts the center of the object making it robust to different scales and occlusions instead of relying on an offset from a predefined anchor box as reference points for bounding box predictions. (Anchor boxes are pre-defined boxes of different sizes and aspect ratios.)
Speeds up processing as it no longer needs to calculate the above-mentioned offset. Non-maximal suppression (NMS) is a post-processing step that helps filter out multiple overlapping bounding box predictions. Multiple bounding box predictions may occur for the same object due to variations in the size, orientation, or position of the object.
Confidence score: the probability that the detected object exists within that bounding box. The algorithm then selects the bounding box with the highest confidence score as the most accurate prediction for that object. If the overlap exceeds a set "IoU" (Intersection over Union) threshold, the bounding box with the lower confidence score is suppressed or removed. This process is repeated until all the bounding boxes have been evaluated.
Object Detection
Localization: finding the area where the object lies aka the presence of a bounding box.
Detection: finding whether the object exists or not.
Recognition: classifies the category of the object.
Two main categories of algorithms (based on how many times the same input image is passed through a network):
1. Single shot detectors: process an entire image using a fully convolutional neural network (CNN) in a single pass and make predictions about the presence and location of objects in the image. less effective and accurate in detecting small objects
2. Two-stage detectors: use two passes of the input image first pass is used to generate a set of proposals or potential object locations second pass is used to refine these proposals and make final predictions more accurate but computationally expensive
→ Single-shot object detection is better suited for real-time applications, while two-shot object detection is better for applications where accuracy is more important.
Two performance evaluation metrics:
Intersection over Union (IoU): measures localization accuracy and calculates localization errors. In other words, the ratio of the overlap to the total area estimates the closeness of the predicted bounding box to the actual one.$$ IoU={Intersection/Overlap \over Union} $$
Average Precision (AP): calculates the average of precision values at different recall levels. It takes into account both how many objects the model detects (recall) and how accurate those detections are (precision).→ Precision measures reliability aka the fraction of correctly detected objects out of all the objects the model predicted.→ Recall measures completeness aka the fraction of correctly detected objects out of all the ground truth (actual) objects. The higher the average precision score better the accuracy of object detection of the model.
Classification Pipeline
Architecture
A simple CNN architecture for YOLOv8 could consist of the following layers: 1. Convolutional layer: extracts features from the input image.
2. Pooling layer: reduces the spatial resolution of the feature maps (improves efficiency and accuracy).
-Efficiency: reduced number of parameters and computations required.
-Accuracy: The model focuses on the most important features of the image.
→ Two main ways: Pooling layers and Strides
3. Fully connected layer: classifies the feature maps into different object classes.
The first 20 convolution layers of the model are pre-trained using ImageNet (plugs in temporary average pooling and fully connected layer).
convolution and connected layers added to a pre-trained network (improves performance).
final fully connected layer predicts both class probabilities and bounding box coordinates.
YOLOv8 divides the input image into a grid of cells. Each cell is responsible for predicting the presence of an object and its bounding box. CNN is then used to extract features from each cell. which are passed to the fully connected layer, which helps predict the presence of an object and its bounding box.