Design and Train a YOLOv2 Network in MATLAB
From the series: Perception
In this video, Neha Goel joins Connell D’Souza how you can design and train a deep neural network in MATLAB®. The example discussed is a You-Only-Look-Once (YOLOv2) neural network. YOLOv2 is a popular real-time object detection algorithm for autonomous systems.
Neha first discusses the architecture of a YOLOv2 network and the different layers and then demonstrates how to assemble the layers in the network and visualize them in MATLAB. She also explains the importance of anchor boxes in a YOLOv2 network.
Neha also covers training options and how they can be manipulated to achieve the best results. The trained model is tested on a test dataset to visually inspect performance before evaluating the network with numerical performance metrics like precision-recall curves.
Resources:
Published: 3 Jan 2020
Hello, and welcome to another episode of the MATLAB and Simulink Robotics Arena. For today's episode, we're going to talk about part 2 in our series on deep learning for object detection. And we're going to talk about designing and training a YOLOv2 network in MATLAB.
Now, if you remember from our previous video where we spoke about pre-processing data and preparing data to get ready for the training stage, so now let's actually go and take that data that we prepared in our previous video and use it to actually first design and then train a YOLOv2 network.
So I've got Neha with me again. Neha is now a Robotics Arena veteran, I would like to call her. So let's get started. So Neha, would you like to take over and talk us through what a YOLOv2 network is?
So, yes. So let's start. For this object detection network, we will be using the YOLOv2. And what is a YOLOv2 network? We have other detection networks. But the YOLOv2, you look only once.
And it's basically targeted for real-time processing images. And it gives the best results when we have the real-time images with us. OK. And here we are working with our data set that have the real-time images.
OK. And YOLOv2 is a network. I think YOLOv3 is a next generation network.
Yes.
But we've seen a lot of YOLOv2. I mean, you were at Robotics Arena. You've seen a lot of YOLOv2 being used there. Yeah, OK.
So in this video, we will be going through how you can actually design the network from scratch. We will be actually designing all the layers.
So putting the layers together.
Yes.
OK. And so the we will be referencing the YOLO9000, YOLOv2 the paper. And we will be going through designing the network according to what is actually the real YOLOv2 network is designed.
And so the basic approach is we'll be having input layers and some middle layers and which will be having convolutional layers, Relu layers, batch normalization layers. And at the end, we will be having the YOLOv2 transform layers and the output.
OK. So are you going to walk us through how we assemble all these layers together?
Yes.
Perfect.
So let's start with input layer. So in MATLAB, we have imageInputLayer function. And that actually creates the input layer for us. And the input for the imageInputLayer is the image size. And how much do you need the normalization or not? So I will talk about the image size here.
Before you actually go into the image size, I know in our previous video, we resized our data set to be 416 by 416 by 3. But I see 128 by 128 by 3 here. So what is the correlation between those two numbers?
Yes. So in this image input layer 128, 128 by 3 means what is the minimum image size we are giving. So if we have a lower image size, then 128 by 128 by 3 it will not accept it. And for 416, it was according to the YOLOv2 network layer.
OK. So next, we will be going to the filter size. So in any classification or convolutional layer, we give the filter size. And the size generally by default is of 3 and 3 height and width of 3 and 3. And it defines the local regions of actually the neurons within the input that we are actually working on.
Next, I'll be talking about the middle layers. So in the middle layers, we are having a repeated batch of Convoluton2DLayer, Batch Normalization, Relu, and Max Pooling Layer. And this is a very basic approach of what YOLO9000 paper actually told us to do.
And I put like few. In YOLO9000 paper, there are many layers. I guess more than 50 layers. But here I'm putting a very basic approach of very basic layer. So we have four convolutional layers and for batch normalization for Relu, and three max pooling layers.
And this is an important point that you raise because there are a lot of networks out there that can do what you want it to do. But there still is a motivation to design your own networks because, especially for the kind of work that our audience are doing, where they're putting these deep learning models onto embedded GPUs where you don't have a lot of processing power. You may want to use a smaller network at that point just so that you can help you speed up your processing.
Yes. And here you see that it's mostly the repeat repetition of the layers. But the only difference is the filter size, the channels that we are just doubling it. So this is also a very basic approach for any convolutional classification models that we are doubling the filter size after every pooling batch. So, OK, let's run this section. And we will see that we have the input layers and our middle layers.
And these are neural network layer classes, right?
Yes.
OK. Cool.
Now, I'll be talking about how will be combining the input and the middle layers. So for that, we'll be using a layer graph function that actually takes as input the input layer, the middle layers, and gives us a layerGraph object. And we will be using that object further of creating the YOLOv2.
Number of classes is the size of the objects-- number of objects that we are detecting here. So here we have the four objects. So that is the number of classes.
And you just have Corellf to identify this programmatically, right? OK. So let's go ahead and run this section, and we'll see that we should have a graph object that's created here. And if you click on this guy, we should see all our layers that we just assembled in MATLAB. So let's jump back into it. And there you can keep going.
OK. So the other important part the step 3 is our anchor boxes. So anchor boxes is a concept within a YOLOv2 structure in our network. So it's kind of a predefined bounding boxes. And it's basically around the size of the objects with respect to the images here, image size.
And the process of calculating the anchor boxes is basically based on using the method of clustering. And I have given the code files here how actually I have done that clustering method and calculated these anchor boxes. And so the basic whole approach here is that it will give you the maximum possibility of the height and width for the objects that are clustered at a certain region.
OK. So what I'm getting from this discussion is that you're saying that the anchor boxes are tied to the image size that we're supplying. And if you remember from our previous video, I asked you what happens if I give it an image size that's larger than our specified size, which is 416 by 416 by 3?
I can now see that, OK, because these anchor boxes are tied to that, giving it a larger size would mean that our paper would not be able to identify the objects.
Yes, that's totally correct. The anchor box is on your image size and the object size. You change the image size. Your object size will change, and your anchor box will change.
Gotcha. So you have to calculate this for every different kind of input image size that you're supplying to this?
Yes.
OK. That sense. That makes sense. All right, so let's go ahead and run this real quick so that we can register the anchor box variable. And this is the code that Neha was talking about, the code which she used to actually calculate these anchor boxes. And if you go ahead and download these files from File Exchange, you should find this code in there with you.
Awesome. So next, we're going to talk about actually Simulink the YOLOv2 network. So I think up till now, we've assembled the input in the middle layers, but now we're actually going to attach some YOLO-specific layers to this, right?
Yes.
OK.
So actually, MATLAB has made it easier for us with this function of YOLOv2 layers. So this YOLOv2 layers adds the YOLOv2 architecture at the end of our layer. You can use those.
So the layers are YOLOv2 transform layer, YOLOv2 output layer. You can individually add them, or you can add them using this function. So in this function, what we are giving? Whatever we have assembled before, the image size, the number of classes we have, the anchor boxes that we calculated, and the layer graph that we calculated.
The last thing is the feature extraction layer. So I'm using here Relu for as a feature extraction layer. You can use any network layer you want, but except for the fully connected layer.
So what are some other kind of feature extraction layers that you could use?
Any layer among the layer graph that you have. So you use--
So it use Convoluton2DLayer?
Yes, anything, just not the fully connected layer. But we are not using a fully connected layer. But some people use a fully connected layer. You should just not use it as a feature extraction.
Gotcha.
So let's run this section. And we will also see the use of the analyze network graph here. So analyze network graph gives us this figure. Actually, it's a pretty cool thing that you can actually see, visualize the whole network that you have.
You can see all the layers, their types, the activation weights, and how this whole structure is there. And here you can see that these are the final YOLOv2 layers that are added on their own with the function of YOLOv2 layer.
Yeah. So that's actually pretty cool because we have a one-line function that's done all that for us. And what can also see in this graph over here is our recurring sequence of Convolutional, Relu, Max Pooling, and et cetera, et cetera, right?
Yeah.
OK, cool. So this is another cool feature that you can use to analyze your network. If you stay tuned for our next video, we're going to talk about an app that you can use to build and edit networks. So stay tuned for that-- plug for the next video. Awesome.
So let's actually get back and let's see how we can train this network. So we've gotten to the stage now where we have an entire network built. We need to go in and train this network. We want to tell it, OK, we want to identify buoys and navigates. So let's keep going.
So as you say, first thing is like we have done the true training as false because this training the network took around many hours.
Yeah, so we don't want to keep you waiting for seven hours. So let's just see-- let's see what our training code involves, and then we'll actually load the detector that we ran on our GPU here.
So while training a network, we have to mention few parameters, few options, that how you will be actually asking the network to learn the stuff. So here it's pretty basic approach in the default approach. I am using the solver stochastic gradient descent and with an initial learning rate of 0.001 with a mini batch size of 16 and max epochs of 80.
So what is the significance of these things is it depends mostly upon the size of your data. And if you have more data-- so initial learning rate means how fast your algorithm is learning. And I put it very low. You can make it higher or lower, depending upon the time you want to spend on the training.
And your maximum epochs, I put it as 80 that it will-- the network will run 80 times. But you can actually make it higher. But it's all like-- it all depends upon your network size, your image size, your data size.
And the other part important part is the mini batch size of 16. So it's like how many batches it will take in one iteration? And I put it at 16. But you can actually lower the size also.
The more the things you put it as lower, it will take like more hours. But here, I kind of played with that. And this is where your validation data set plays a role. And because you play with your parameters during your validation set. And you can test all your parameter thing that which suits best for your network using the validation data set.
OK. You said that these parameters sort of tied to how big our data set is, but we do not have a very large data set. We had like maybe, what? 3,000 odd images? Yes.
So I'm sure actually using this for your applications, it probably wouldn't take as long to train because you can play around with these parameters. But, also you will have larger amounts of data to deal with.
I see a couple of more sort of options in there. One is the dispatch and background option. And the other one is the execution environment. Can you talk a little bit about those?
Yeah. So these two are based on your hardware on or your computer on which you are actually running your algorithm. So the execution environment here I've put it as auto. So automatically, it will takes-- it will take as its a CPU or a GPU machine. But few machines sometimes have a multiple GPU, or they want parallel computing to go forward. And these all help in running the training network faster.
OK. So you can scale it up to clusters, et cetera, and stuff for that.
Yes. So if you are using the execution environment as multi-GPU, or you are using the parallel computing, then comes the role of the dispatching background. And it's true it will use a parallel computing power. If not, then it will not use it.
OK. Cool. And this is good because, as a developer, you don't have to really do a lot of stuff to make sure that you're running it on a multi-GPU versus a single GPU. It's just another option that you add into the training options. Awesome.
And then at the end, I can see that we have the train YOLOv2ObjectDetector function. I'm guessing that's the one that actually trains the network because the name is very eloquent, so to say.
So to this function, we are just giving our training data at the whole network, the layer graph that we actually had before, and the options that we decided. So, yeah, we have also linked to the training options document where you can actually play with more type of options that you actually want or you want to play with in your data set.
OK. So, cool, why don't we go ahead and we'll run through this section, which is basically just going to run this last line of code, which is going to load in the detector that we've-- well, it's a detector that we train with this code. It's just that we train it earlier. We don't want to have you sit through our training process.
But let's go ahead and run this. We should see a detector YOLOv2 object created here. And as you can see right here. So what we have now is a train detector that we can use to see if we can identify our objects of interest or not, right?
Yes.
So let's go ahead and run this next section of code, which is basically what we're doing is we're-- so our data set is a collection of a lot of images. And we're just going to step through all those images and view that through a deployable video player. So I'm going to go ahead and run this real quick.
And we should see that our detector is able to identify the navigates in our data set. This is also going to give us a bounding box. So when you're trying to-- when you're trying to connect this to your control algorithm, you can say, OK, this is where in my image this particular object is.
And as you see, as our submarine sort of goes a little bit ahead through the navigates, you can see that it's able to identify those buoys as well. And it's able to distinguish them. I think that's a pretty remarkable thing because what you remember is that this data set is all under water. So colors don't really look like colors anymore, or they don't really look the way that we want it to look. So that's great, OK?
Now, that we finished stepping through our entire data set, let's go ahead-- so we've visually seen that there are detectors doing well. It's doing what it's supposed to do. It's able to identify all the objects of interest, but there has to be some numerical ways in which we can decide how good or how bad our detector is. So do you want to talk us through what some evaluation metrics are for this detector?
Yeah. So for this, for object detection, MATLAB has a few functions for detection for evaluating the metrics and the plots. So we have two basic functions here that I'll be using is evaluateDetectionPrecision and evaluationDetectionMissRate.
So with the precision function, we are seeing how precise our network is for each class, and we'll be plotting those, the recall and the precision things. And the other the detection miss rate is, like, it should be lower. So less the miss rate is better. And the output that we get from the detectionMissRate function is the log miss rate to the false positives we have per image and per object.
So next, if we go down and you run this step, you will see the plotting of-- I just plotted for every object that we have and with every precision and for every miss rate. So if you see here that blue is a navigate, it gives a precision of 95% and a miss rate of like just 0.05. And the other you can see is kind of lower, but it all depends upon your size of the data.
So we had a lot more images of the navigate than we had at the buoys. And that reflects in the way it's run. Another parameter that I want to talk about is the value of threshold. So detection, precision, and miss rate are being calculated as the intersection over union of our actual inference results as compared to our Ground Truth.
So if you have a lower threshold what's going to happen is that we are basically telling our evaluation functions that, yeah, we're OK as long as there is a smaller ratio of overlap between the Ground Truth and our inference results. But as you see, the moment I increase this, it will change our plots significantly.
OK. So as you can see now, it's at a threshold value of 0.8. It's saying that we're actually missing a lot more than we should. And our precision, which was, remember, 95 for the navigate earlier, has now dropped to 40.
So as you can see, your metrics are definitely very relative. It's very relative factors. You want to keep in mind what your system can handle and not when evaluating your detectors.
Awesome. So we're now through two parts of our deep learning for object detection video series. Highly recommend that you stick around for part 3, which is talking about importing networks from other environments like TensorFlow or PyTorch or whatever other environment you're using to pull that network into MATLAB and then use MATLAB's training functions to train.
All right, so stick around for part 3. And in the meantime, again, you can get in touch with us through our Facebook group or our email on the screen. And don't forget some of the other resources that we have available for you. See you around soon.