# trainYOLOv2ObjectDetector

Train YOLO v2 object detector

## Description

### Train a Detector

example

detector = trainYOLOv2ObjectDetector(trainingData,lgraph,options) returns an object detector trained using you only look once version 2 (YOLO v2) network architecture specified by the input lgraph. The options input specifies training parameters for the detection network.

example

[detector,info] = trainYOLOv2ObjectDetector(___) also returns information on the training progress, such as the training accuracy and learning rate for each iteration.

### Resume Training a Detector

example

detector = trainYOLOv2ObjectDetector(trainingData,checkpoint,options) resumes training from the saved detector checkpoint.

You can use this syntax to:

• Add more training data and continue the training.

• Improve training accuracy by increasing the maximum number of iterations.

### Fine Tune a Detector

detector = trainYOLOv2ObjectDetector(trainingData,detector,options) continues training a YOLO v2 object detector. Use this syntax for fine-tuning a detector.

### Multiscale Training

detector = trainYOLOv2ObjectDetector(___,'TrainingImageSize',trainingSizes) specifies the image sizes for multiscale training by using a name-value pair in addition to the input arguments in any of the preceding syntaxes.

## Examples

collapse all

Load the training data for vehicle detection into the workspace.

trainingData = data.vehicleTrainingData;

Specify the directory in which training samples are stored. Add full path to the file names in training data.

Randomly shuffle data for training.

rng(0);
shuffledIdx = randperm(height(trainingData));
trainingData = trainingData(shuffledIdx,:);

Create an imageDatastore using the files from the table.

imds = imageDatastore(trainingData.imageFilename);

Create a boxLabelDatastore using the label columns from the table.

blds = boxLabelDatastore(trainingData(:,2:end));

Combine the datastores.

ds = combine(imds, blds);

Load a preinitialized YOLO v2 object detection network.

lgraph = net.lgraph
lgraph =
LayerGraph with properties:

Layers: [25×1 nnet.cnn.layer.Layer]
Connections: [24×2 table]
InputNames: {'input'}
OutputNames: {'yolov2OutputLayer'}

Inspect the layers in the YOLO v2 network and their properties. You can also create the YOLO v2 network by following the steps given in Create YOLO v2 Object Detection Network.

lgraph.Layers
ans =
25x1 Layer array with layers:

1   'input'               Image Input                128x128x3 images
2   'conv_1'              Convolution                16 3x3 convolutions with stride [1  1] and padding [1  1  1  1]
3   'BN1'                 Batch Normalization        Batch normalization
4   'relu_1'              ReLU                       ReLU
5   'maxpool1'            Max Pooling                2x2 max pooling with stride [2  2] and padding [0  0  0  0]
6   'conv_2'              Convolution                32 3x3 convolutions with stride [1  1] and padding [1  1  1  1]
7   'BN2'                 Batch Normalization        Batch normalization
8   'relu_2'              ReLU                       ReLU
9   'maxpool2'            Max Pooling                2x2 max pooling with stride [2  2] and padding [0  0  0  0]
10   'conv_3'              Convolution                64 3x3 convolutions with stride [1  1] and padding [1  1  1  1]
11   'BN3'                 Batch Normalization        Batch normalization
12   'relu_3'              ReLU                       ReLU
13   'maxpool3'            Max Pooling                2x2 max pooling with stride [2  2] and padding [0  0  0  0]
14   'conv_4'              Convolution                128 3x3 convolutions with stride [1  1] and padding [1  1  1  1]
15   'BN4'                 Batch Normalization        Batch normalization
16   'relu_4'              ReLU                       ReLU
17   'yolov2Conv1'         Convolution                128 3x3 convolutions with stride [1  1] and padding 'same'
18   'yolov2Batch1'        Batch Normalization        Batch normalization
19   'yolov2Relu1'         ReLU                       ReLU
20   'yolov2Conv2'         Convolution                128 3x3 convolutions with stride [1  1] and padding 'same'
21   'yolov2Batch2'        Batch Normalization        Batch normalization
22   'yolov2Relu2'         ReLU                       ReLU
23   'yolov2ClassConv'     Convolution                24 1x1 convolutions with stride [1  1] and padding [0  0  0  0]
24   'yolov2Transform'     YOLO v2 Transform Layer.   YOLO v2 Transform Layer with 4 anchors.
25   'yolov2OutputLayer'   YOLO v2 Output             YOLO v2 Output with 4 anchors.

Configure the network training options.

options = trainingOptions('sgdm',...
'InitialLearnRate',0.001,...
'Verbose',true,...
'MiniBatchSize',16,...
'MaxEpochs',30,...
'Shuffle','never',...
'VerboseFrequency',30,...
'CheckpointPath',tempdir);

Train the YOLO v2 network.

[detector,info] = trainYOLOv2ObjectDetector(ds,lgraph,options);
*************************************************************************
Training a YOLO v2 Object Detector for the following object classes:

* vehicle

Training on single CPU.
|========================================================================================|
|  Epoch  |  Iteration  |  Time Elapsed  |  Mini-batch  |  Mini-batch  |  Base Learning  |
|         |             |   (hh:mm:ss)   |     RMSE     |     Loss     |      Rate       |
|========================================================================================|
|       1 |           1 |       00:00:01 |         7.13 |         50.8 |          0.0010 |
|       2 |          30 |       00:00:14 |         1.35 |          1.8 |          0.0010 |
|       4 |          60 |       00:00:27 |         1.13 |          1.3 |          0.0010 |
|       5 |          90 |       00:00:39 |         0.64 |          0.4 |          0.0010 |
|       7 |         120 |       00:00:51 |         0.65 |          0.4 |          0.0010 |
|       9 |         150 |       00:01:04 |         0.72 |          0.5 |          0.0010 |
|      10 |         180 |       00:01:16 |         0.52 |          0.3 |          0.0010 |
|      12 |         210 |       00:01:28 |         0.45 |          0.2 |          0.0010 |
|      14 |         240 |       00:01:41 |         0.61 |          0.4 |          0.0010 |
|      15 |         270 |       00:01:52 |         0.43 |          0.2 |          0.0010 |
|      17 |         300 |       00:02:05 |         0.42 |          0.2 |          0.0010 |
|      19 |         330 |       00:02:17 |         0.52 |          0.3 |          0.0010 |
|      20 |         360 |       00:02:29 |         0.43 |          0.2 |          0.0010 |
|      22 |         390 |       00:02:42 |         0.43 |          0.2 |          0.0010 |
|      24 |         420 |       00:02:54 |         0.59 |          0.4 |          0.0010 |
|      25 |         450 |       00:03:06 |         0.61 |          0.4 |          0.0010 |
|      27 |         480 |       00:03:18 |         0.65 |          0.4 |          0.0010 |
|      29 |         510 |       00:03:31 |         0.48 |          0.2 |          0.0010 |
|      30 |         540 |       00:03:42 |         0.34 |          0.1 |          0.0010 |
|========================================================================================|
Detector training complete.
*************************************************************************

Inspect the properties of the detector.

detector
detector =
yolov2ObjectDetector with properties:

ModelName: 'vehicle'
Network: [1×1 DAGNetwork]
TrainingImageSize: [128 128]
AnchorBoxes: [4×2 double]
ClassNames: vehicle

You can verify the training accuracy by inspecting the training loss for each iteration.

figure
plot(info.TrainingLoss)
grid on
xlabel('Number of Iterations')
ylabel('Training Loss for Each Iteration')

Read a test image into the workspace.

Run the trained YOLO v2 object detector on the test image for vehicle detection.

[bboxes,scores] = detect(detector,img);

Display the detection results.

if(~isempty(bboxes))
img = insertObjectAnnotation(img,'rectangle',bboxes,scores);
end
figure
imshow(img)

## Input Arguments

collapse all

Labeled ground truth images, specified as a datastore or a table.

• If you use a datastore, your data must be set up so that calling the datastore with the read and readall functions returns a cell array or table with two or three columns. When the output contains two columns, the first column must contain bounding boxes, and the second column must contain labels, {boxes,labels}. When the output contains three columns, the second column must contain the bounding boxes, and the third column must contain the labels. In this case, the first column can contain any type of data. For example, the first column can contain images or point cloud data.

databoxeslabels
The first column can contain data, such as point cloud data or images.The second column must be a cell array that contains M-by-5 matrices of bounding boxes of the form [xcenter, ycenter, width, height, yaw]. The vectors represent the location and size of bounding boxes for the objects in each image.The third column must be a cell array that contains M-by-1 categorical vectors containing object class names. All categorical data returned by the datastore must contain the same categories.

• If you use a table, the table must have two or more columns. The first column of the table must contain image file names with paths. The images must be grayscale or truecolor (RGB) and they can be in any format supported by imread. Each of the remaining columns must be a cell vector that contains M-by-4 matrices that represent a single object class, such as vehicle, flower, or stop sign. The columns contain 4-element double arrays of M bounding boxes in the format [x,y,width,height]. The format specifies the upper-left corner location and size of the bounding box in the corresponding image. To create a ground truth table, you can use the Image Labeler app or Video Labeler app. To create a table of training data from the generated ground truth, use the objectDetectorTrainingData function.

Note

When the training data is specified using a table, the trainYOLOv2ObjectDetector function checks these conditions

• The bounding box values must be integers. Otherwise, the function automatically rounds each noninteger values to its nearest integer.

• The bounding box must not be empty and must be within the image region. While training the network, the function ignores empty bounding boxes and bounding boxes that lie partially or fully outside the image region.

Layer graph, specified as a LayerGraph object. The layer graph contains the architecture of the YOLO v2 network. You can create this network by using the yolov2Layers function. Alternatively, you can create the network layers by using yolov2TransformLayer, yolov2ReorgLayer, and yolov2OutputLayer functions. For more details on creating a custom YOLO v2 network, see Design a YOLO v2 Detection Network.

Training options, specified as a TrainingOptionsSGDM, TrainingOptionsRMSProp, or TrainingOptionsADAM object returned by the trainingOptions (Deep Learning Toolbox) function. To specify the solver name and other options for network training, use the trainingOptions (Deep Learning Toolbox) function.

Note

The trainYOLOv2ObjectDetector function does not support these training options:

• The trainingOptions Shuffle values, 'once' and 'every-epoch' are not supported when you use a datastore input.

• Datastore inputs are not supported when you set the DispatchInBackground training option to true.

Saved detector checkpoint, specified as a yolov2ObjectDetector object. To save the detector after every epoch, set the 'CheckpointPath' name-value argument when using the trainingOptions function. Saving a checkpoint after every epoch is recommended because network training can take a few hours.

To load a checkpoint for a previously trained detector, load the MAT-file from the checkpoint path. For example, if the CheckpointPath property of the object specified by options is '/checkpath', you can load a checkpoint MAT-file by using this code.

checkpoint = data.detector;

The name of the MAT-file includes the iteration number and timestamp of when the detector checkpoint was saved. The detector is saved in the detector variable of the file. Pass this file back into the trainYOLOv2ObjectDetector function:

yoloDetector = trainYOLOv2ObjectDetector(trainingData,checkpoint,options);

Previously trained YOLO v2 object detector, specified as a yolov2ObjectDetector object. Use this syntax to continue training a detector with additional training data or to perform more training iterations to improve detector accuracy.

Set of image sizes for multiscale training, specified as an M-by-2 matrix, where each row is of the form [height width]. For each training epoch, the input training images are randomly resized to one of the M image sizes specified in this set.

If you do not specify the trainingSizes, the function sets this value to the size in the image input layer of the YOLO v2 network. The network resizes all training images to this value.

Note

The input trainingSizes values specified for multiscale training must be greater than or equal to the input size in the image input layer of the lgraph input argument.

## Output Arguments

collapse all

Trained YOLO v2 object detector, returned as yolov2ObjectDetector object. You can train a YOLO v2 object detector to detect multiple object classes.

Training progress information, returned as a structure array with seven fields. Each field corresponds to a stage of training.

• TrainingLoss — Training loss at each iteration is the mean squared error (MSE) calculated as the sum of localization error, confidence loss, and classification loss. For more information about the training loss function, see Training Loss.

• TrainingRMSE — Training root mean squared error (RMSE) is the RMSE calculated from the training loss at each iteration.

• BaseLearnRate — Learning rate at each iteration.

• ValidationLoss — Validation loss at each iteration.

• ValidationRMSE — Validation RMSE at each iteration.

• FinalValidationLoss — Final validation loss at end of the training.

• FinalValidationRMSE — Final validation RMSE at end of the training.

Each field is a numeric vector with one element per training iteration. Values that have not been calculated at a specific iteration are assigned as NaN. The struct contains ValidationLoss, ValidationAccuracy, ValidationRMSE, FinalValidationLoss, and FinalValidationRMSE fields only when options specifies validation data.

collapse all

### Data Preprocessing

By default, the trainYOLOv2ObjectDetector function preprocesses the training images by:

• Resizing the input images to match the input size of the network.

• Normalizing the pixel values of the input images to lie in the range [0, 1].

When you specify the training data by using a table, the trainYOLOv2ObjectDetector function performs data augmentation for preprocessing. The function augments the input dataset by:

• Reflecting the training data horizontally. The probability for horizontally flipping each image in the training data is 0.5.

• Uniformly scaling (zooming) the training data by a scale factor that is randomly picked from a continuous uniform distribution in the range [1, 1.1].

• Random color jittering for brightness, hue, saturation, and contrast.

When you specify the training data by using a datastore, the trainYOLOv2ObjectDetector function does not perform data augmentation. Instead you can augment the training data in datastore by using the transform function and then, train the network with the augmented training data. For more information on how to apply augmentation while using datastores, see Apply Augmentation to Training Data in Datastores (Deep Learning Toolbox).

### Training Loss

During training, the YOLO v2 object detection network optimizes the MSE loss between the predicted bounding boxes and the ground truth. The loss function is defined as

$\begin{array}{l}{K}_{1}\sum _{i=0}^{{S}^{2}}\sum _{j=0}^{B}{1}_{ij}^{obj}\left[{\left({x}_{i}-{\stackrel{^}{x}}_{i}\right)}^{2}+{\left({y}_{i}-{\stackrel{^}{y}}_{i}\right)}^{2}\right]\text{\hspace{0.17em}}\\ \text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}+\text{\hspace{0.17em}}{K}_{1}\sum _{i=0}^{{S}^{2}}\sum _{j=0}^{B}{1}_{ij}^{obj}\left[{\left(\sqrt{{w}_{i}}-\sqrt{{\stackrel{^}{w}}_{i}}\right)}^{2}+{\left(\sqrt{{h}_{i}}-\sqrt{{\stackrel{^}{h}}_{i}}\right)}^{2}\right]\\ \text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}+{K}_{2}\sum _{i=0}^{{S}^{2}}\sum _{j=0}^{B}{1}_{ij}^{obj}{\left({C}_{i}-{\stackrel{^}{C}}_{i}\right)}^{2}\\ \text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}+{K}_{3}\sum _{i=0}^{{S}^{2}}\sum _{j=0}^{B}{1}_{ij}^{noobj}{\left({C}_{i}-{\stackrel{^}{C}}_{i}\right)}^{2}\\ \text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}+\text{\hspace{0.17em}}{K}_{4}\sum _{i=0}^{{S}^{2}}{1}_{i}^{obj}\sum _{c\in classes}{\left({p}_{i}\left(c\right)-{\stackrel{^}{p}}_{i}\left(c\right)\right)}^{2}\end{array}$

where:

• S is the number of grid cells.

• B is the number of bounding boxes in each grid cell.

• ${1}_{ij}^{obj}$ is 1 if the jth bounding box in grid cell i is responsible for detecting the object. Otherwise it is set to 0. A grid cell i is responsible for detecting the object, if the overlap between the ground truth and a bounding box in that grid cell is greater than or equal to 0.6.

• ${1}_{ij}^{noobj}$ is 1 if the jth bounding box in grid cell i does not contain any object. Otherwise it is set to 0.

• ${1}_{i}^{obj}$ is 1 if an object is detected in grid cell i. Otherwise it is set to 0.

• K1, K2, K3, and K4 are the weights. To adjust the weights, modify the LossFactors property of the output layer by using the yolov2OutputLayer function.

The loss function can be split into three parts:

• Localization loss

The first and second terms in the loss function comprise the localization loss. It measures error between the predicted bounding box and the ground truth. The parameters for computing the localization loss include the position, size of the predicted bounding box, and the ground truth. The parameters are defined as follows.

• $\left({x}_{i},{y}_{i}\right)$, is the center of the jth bounding box relative to grid cell i.

• $\left({\stackrel{^}{x}}_{i},{\stackrel{^}{y}}_{i}\right)$, is the center of the ground truth relative to grid cell i.

• ${w}_{i}\text{\hspace{0.17em}}\text{and}\text{\hspace{0.17em}}{h}_{i}$ is the width and the height of the jth bounding box in grid cell i, respectively. The size of the predicted bounding box is specified relative to the input image size.

• ${\stackrel{^}{w}}_{i}\text{\hspace{0.17em}}\text{and}\text{\hspace{0.17em}}{\stackrel{^}{h}}_{i}$ is the width and the height of the ground truth in grid cell i, respectively.

• K1 is the weight for localization loss. Increase this value to increase the weightage for bounding box prediction errors.

• Confidence loss

The third and fourth terms in the loss function comprise the confidence loss. The third term measures the objectness (confidence score) error when an object is detected in the jth bounding box of grid cell i. The fourth term measures the objectness error when no object is detected in the jth bounding box of grid cell i. The parameters for computing the confidence loss are defined as follows.

• Ci is the confidence score of the jth bounding box in grid cell i.

• Ĉi is the confidence score of the ground truth in grid cell i.

• K2 is the weight for objectness error, when an object is detected in the predicted bounding box. You can adjust the value of K2 to weigh confidence scores from grid cells that contain objects.

• K3 is the weight for objectness error, when an object is not detected in the predicted bounding box. You can adjust the value of K3 to weigh confidence scores from grid cells that do not contain objects.

The confidence loss can cause the training to diverge when the number of grid cells that do not contain objects is more than the number of grid cells that contain objects. To remedy this, increase the value for K2 and decrease the value for K3.

• Classification loss

The fifth term in the loss function comprises the classification loss. For example, suppose that an object is detected in the predicted bounding box contained in grid cell i. Then, the classification loss measures the squared error between the class conditional probabilities for each class in grid cell i. The parameters for computing the classification loss are defined as follows.

• pi (c) is the estimated conditional class probability for object class c in grid cell i.

• ${\stackrel{^}{p}}_{i}\left(c\right)$ is the actual conditional class probability for object class c in grid cell i.

• K4 is the weight for classification error when an object is detected in the grid cell. Increase this value to increase the weightage for classification loss.

## Tips

• To generate the ground truth, use the Image Labeler or Video Labeler app. To create a table of training data from the generated ground truth, use the objectDetectorTrainingData function.

• To improve prediction accuracy,

• Increase the number of images you can use to train the network. You can expand the training dataset through data augmentation. For information on how to apply data augmentation for preprocessing, see Preprocess Images for Deep Learning (Deep Learning Toolbox).

• Perform multiscale training by using the trainYOLOv2ObjectDetector function. To do so, specify the 'TrainingImageSize' argument of trainYOLOv2ObjectDetector function for training the network.

• Choose anchor boxes appropriate to the dataset for training the network. You can use the estimateAnchorBoxes function to compute anchor boxes directly from the training data.

## References

[1] Joseph. R, S. K. Divvala, R. B. Girshick, and F. Ali. "You Only Look Once: Unified, Real-Time Object Detection." In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 779–788. Las Vegas, NV: CVPR, 2016.

[2] Joseph. R and F. Ali. "YOLO 9000: Better, Faster, Stronger." In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6517–6525. Honolulu, HI: CVPR, 2017.

### Objects

Introduced in R2019a