Deploy Deep Neural Networks to NVIDIA GPUs and CPUs from Simulink using GPU Coder - MATLAB
Video Player is loading.
Current Time 0:00
Duration 25:44
Loaded: 0.65%
Stream Type LIVE
Remaining Time 25:44
 
1x
  • Chapters
  • descriptions off, selected
  • en (Main), selected
    Video length is 25:44

    Deploy Deep Neural Networks to NVIDIA GPUs and CPUs from Simulink using GPU Coder

    Overview

    Designing deep learning and computer vision applications and deploying to embedded GPUs and CPUs like NVIDIA Jetson and DRIVE platforms is challenging because of resource constraints inherent in embedded devices. A Simulink-based workflow facilitates the design of these applications, and automatic generated C/C++ or CUDA® code can be deployed to achieve up to 2X faster inference than other deep learning frameworks.

    This webinar walks you through the workflow. Create an automotive lane and vehicle detection application using a YOLO v2 network in Simulink running on the CPU, test it on the same desktop using a Titan V GPU, then deploy it onto a Jetson AGX Xavier. Design and deploy deep learning networks for pedestrian detection, blood smear segmentation, and defective product detection to either an Intel Xeon processor on a desktop, ARM Cortex-A processor on a Raspberry Pi, or an NVIDIA Jetson AGX Xavier. Learn how to access peripherals from the Jetson platform for use in Simulink and with the generated code. Finally, hear about optimizations applied to the generated code that help it achieves up to 2X faster inference than other deep learning frameworks.

    Highlights

    Watch this webinar to learn how to: 

    1. Run simulations of deep learning networks in Simulink models on your desktop CPU or GPU
    2. Generate C/C++ or CUDA code from deep learning networks in Simulink models as inference engines for NVIDIA GPUs, Intel Xeon CPUs, or ARM Cortex-A processors
    3. Generate code from complete applications, including one or more deep learning networks along with pre- and postprocessing code, for NVIDIA GPUs, Intel Xeon CPUs, or ARM Cortex-A processors
    4. Automate the process to compile and download the generated code onto NVIDIA GPUs like the Jetson AGX Xavier and NVIDIA DRIVE boards
    5. Access peripherals from the Jetson platform for use in Simulink and with the generated code
    6. Apply optimizations to boost the performance of the generated C/C++ or CUDA code for deep learning networks

    About the Presenter

    Bill Chou is the Product Manager for MATLAB Coder and has been working with MathWorks code generation technologies for 15 years. Bill holds an M.S. degree in Electrical Engineering from the University of Southern California and a B.A.Sc degree in Electrical Engineering from the University of British Columbia.

    Recorded: 27 Oct 2020

    Hello, my name is Bill Chow. I'm the product manager for GPU Coder. And welcome to this webinar on Deployed Deep Neural Networks to NVIDIA GPUs and CPUs from Simulink using GPU Coder.

    Now over the past several years, we see many of our customers start adopting deep learning in their designs. MATLAB is a deep learning framework. And it enables you to design and train your deep learning networks, creating a complete application around it, including the pre and post-processing, and deploy it to a variety of different hardware.

    Now if we look at the three parts of this deep learning workflow, and taking a look at the first part here, in terms of designing deep learning networks, you can create them from scratch instead of MATLAB.

    Once you've defined the deep learning network structure, you can then train it using local resources, CPUs or GPUs attached to your local machines, or access other compute resources on your local cluster or even the cloud using Amazon EC2.

    There are other ways to bring in deep learning networks into your design. If you have deep learning networks or models from other frameworks, such as Keras, TensorFlow, Caffe, or ONNX, you can import them into MATLAB.

    Another way is to use reference models. For example, ResNet and others, bring them into MATLAB, then apply transfer learning to make it do something totally new. So from MATLAB, you're able to manage large data sets, automate the data labeling, and have easy access to pre-trained models.

    And we talked a little bit about training inside of MATLAB already. Moving to the second part of the workflow, in terms of designing complete applications around your deep learning networks, chances are you're going to have some pre and post-processing around it.

    And of course, you may have more than one deep learning network in your design. So you're able to do all of this from within MATLAB and Simulink.

    Finally, as you get ready to deploy your complete application, you want to take a look at the types of hardware that you're able to deploy to. Now, you might be deploying to other desktop or desktop-like systems. You might be taking them to data centers. Or perhaps you might be taking them to embedded systems. And there are a variety of different hardware out there ready to be deployed.

    Now if you look at the conventional approach to taking your algorithms down to embedded hardware, you might see something like this with the four steps. So you might start with some functional tests within Simulink. Then you look at deploying onto your desktop GPU first.

    Now at this point, you're probably doing some unit testing. You're sticking with a higher level language like Simulink, working with deep learning frameworks and probably working with large complex software stacks.

    Now in the third part where you're doing deployment integration tests, you've probably migrated from Simulink to C and C++. You might be working with some low level APIs related to deep learning networks. And you might be using some application-specific libraries.

    Once you've gone through that step, you're ready to do some real-time testing on your embedded GPU. And here again you are working with C and C++, target optimized libraries for your GPUs. And you're probably looking to optimize for memory and speed.

    Now the challenge of this kind of conventional approach is the fact that you have to deal with integrating multiple libraries and packages. At the same time, you need to verify and maintain multiple implementations in Simulink as well as the C++ in CUDA. And you also have to be mindful of algorithm and vendor lock-in as you migrate between different types of hardware.

    So what's the solution? And the solution here is to use Simulink Coder and GPU Coder for your deep learning deployment. Now you can use Simulink Coder in order to target CPUs from Intel as well as ARM. And you can use GPU Coder to deploy to NVIDIA GPUs. And you can make use of different target libraries, including cuDNN as well as TensorRT.

    Now using Simulink Coder, you can deploy things like pedestrian detection networks onto an ARM Cortex-A processor. And this makes use of the ARM compute library to optimize performance. If you're looking at targeting Intel's CPUs, you can make use of the MKL-DNN or the oneDNN library. And you can deploy things like a blood smear segmentation networks to these processors.

    And finally using GPU Coder, you can target NVIDIA GPUs, both desktop as well as embedded GPUs. And another application could be defect detection. Looking at all these different application, we have to see, well, what's the performance like when we deploy to all these different embedded processors.

    Now, if we take a look at some of the benchmarks that we've run here, here we're showing the single image inference performance running on an NVIDIA Titan V GPU using cuDNN. So the orange bar here is the performance you get from using GPU Coder. And this is running a standard ResNet 50 deep learning network.

    And you can see we're about north of 400 images per second. Now compared to what you see with TensorFlow, which is just slightly above 300 frames per second running on the same machine with the same CPU, GPU, as well as the TensorRT versions. Another benchmark we'll see here is a difference in performance using NVIDIA's cuDNN running single precision data types, FP32.

    And if you switch over to using TensorRT, and especially if you use the intake data types, you can see a pretty nice speedup in terms of single image inference running again on a Titan V GPU with the ResNet 50 deep learning network.

    So far, what we've seen is using Simulink Coder and GPU Coder, you can generate code from deep learning networks. That takes advantage of several different optimization libraries. So for example, from NVIDIA, taking advantage of TensorRT and cuDNN. For Intel processors, the MKL-DNN or the oneDNN library. And for ARM Cortex-A processors, making use of the ARM Compute libraries.

    But these deep learning libraries are great for inference. But as we mentioned earlier, chances are you're going to be building a complete application around your deep learning network. That's going to do more than just inference.

    So the idea here is that you can deploy your complete application, including pre-processing, post-processing. You can likely have more than one deploying network inside your application. And using GPU Coder, you would then be able to generate portable target code that would run on these different types of networks.

    Now to illustrate this idea, let's take a look at an example of a highway lane following model. So instead of Simulink, we have this example model that has several subsystems. So starting on the far right, we have subsystems for vehicle dynamics, decision logic, and controller, sensor fusion, and 3D scenario and visualization.

    And of course, we have the vision detector as well. So if you run the simulation inside of Simulink, you might be able to see on the far left, the lane markers that are being detected, the vehicles with bounding boxes around it, as well as other simulations. So let's take a closer look at the vision detector subsystem here.

    So instead of that, it might look something like this. We'll use two deep learning networks. One is going to detect lanes. And the other is going to detect the vehicles that's on the road. And there'll be some pre and post-processing just before and after it.

    The two deep learning networks, the lane detection network is going to be based on AlexNet. We've actually modified it a little bit so that it can do what we ask it to do. And the vehicle detector is going to be based on YOLO v2.

    Now in order to use deep learning networks instead of Simulink, you can open the Simulink library browser. And under Deep Learning Network and Deep Neural Networks, you can find these two blocks, Image Classifier and Predict. So these are the ones available as of release 2020b.

    Now in addition, if you're comfortable writing MATLAB code, then you can of course use the MATLAB function block. And you can draw up pretty much any of the deep learning networks that are supported inside of MATLAB.

    So once we go through, design the subsystem, then we'll pass in some input video. And you can see the output video where we're detecting the left and right lanes, plus drawing bounding boxes around vehicles that we see on the road.

    So the workflow that we're going to go through, there'll be three parts. First, we'll run the simulation on our Intel CPU. From there, we'll switch over to our desktop GPU where we'll run simulation and generate CUDA code for that. And finally, we'll take our complete application, generate CUDA code, and run it on the Jetson AGX Xavier.

    So let's take a look at the first part where we'll run the simulation on our CPU. So here's our lane detector. We have two deep learning networks, as you can see over there.

    Input video is going to come in. And it'll go through some pre-processing to resize the image, which is then fed into the lane detector. And this deep learning network is defined in this MAT file that's available on our MATLAB path.

    From there, we'll do some post-processing to detect the coordinates, and then do some annotation to highlight the left and right lanes. And finally, in parallel, our input video stream is also going to the second deep learning network. And this is our vehicle detector. And it's based on YOLO v2.

    If we open it up, we'll see inside the MATLAB function block. It's pointing to this MAT file, which is defining the YOLO v2 network. From there, the output then goes through some annotations where we'll draw bounding boxes around the vehicles.

    So if we click on Play and run it on our Intel CPU, you can see our input video on the left, and then our output video on the right. So it's running on our Intel CPU. This is a pretty large network, especially the YOLO v2. So processing will take a bit of time.

    But you can see in the output that we're identifying the left and right lanes marked in green. And then also drawing bounding boxes around vehicles that we see in the frame. So the frame rate isn't that great so in a moment, we'll take a look at running that on the GPU.

    But it does work, as you can see there. So running on to our second part, we'll switch over to using our desktop GPU where we'll run the simulation and generate some CUDA code. And before we do that for our deep learning networks, as we mentioned before, we'll be using cuDNN and TensorRT, which are the optimization libraries available from NVIDIA. So that will help speed things up.

    And for the non-deep learning portions in the pre and post-processing. GPU Coder will also generate optimized CUDA code that will make it run faster on the desktop GPU.

    So we're back into the same model. From that, we'll go into Model Settings. And once we open that up, we can go into Simulation Targets. You'll see here language is set to C++. We'll check that GPU Acceleration box, which will let us use our GPU.

    The target libraries, we have both cuDNN and TensorRT available. So we'll stick with cuDNN as a default. From there, we can go ahead and click on the Simulation button. And here you can see the frame rate is definitely much faster than what we saw in the CPU.

    Again, the input video on the left and the output video on the right. And you can see on the output, we're drawing the left and right lanes as we did before, and also bounding boxes around vehicles.

    So simulation looks good. Let's go ahead and start either the Embedded Coder or Simulink Coder app in order to generate code. So in this case, we'll start the Embedded Coder app. From there, we'll go into Configuration Settings, and go into Code Generation.

    Here you can see for system target file we have the appropriate targets selected. Languages will still be C++. And we check this Generate GPU Code button so that we're generating CUDA code.

    For the toolchain, we're going to select the NVIDIA toolchain. And underneath that in Interface, here is where we can select the deep learning libraries that we want to use. So again, both cuDNN and TensorRT are available. We'll stick with cuDNN for now.

    Under GPU Code, you can see other optimized CUDA libraries, cuBLAS, cuSOLVER, and cuFFT. These are all selected. And GPU Coder will do its best to generate code that leverages those libraries.

    So once we generate code, inside the code generation report, let's search for the step function. And once it pops up, we can scroll down a little bit. Here you can see that GPU Coder is CUDAMalloc calls, which will allocate memory on the GPU memory.

    We also have CUDAMemCpy, which is moving the data between CPU, memory and GPU memory and back. And here you can see a couple of CUDA kernels being launched. These will run on the GPU and will help to boost performance.

    Now take a look at the January code for the two deep learning networks, this is the one for LaneNet. And you can see all the public and private methods that are available. The standard ones-- setup, predict, and cleanup.

    Here's the one for a second deep learning network YOLO v2. Again, pretty similar set of and private methods. If we look inside the setup method, you'll see that this is executed right when the application starts for the first time.

    So here we're loading the deep learning network into memory. And as you go through it, you can see that we're doing this one layer at a time, starting with layer 0, layer 1. And for each layer, we're specifying the layer type. And then from there also loading in the weights and biases so that the deep learning network will be able to run inference when we call it.

    So that was the second part where we ran the simulation on our desktop GPU. And we also generate CUDA code. And from there, we can go to the third part, well, where we'll generate CUDA code and then run this on the Jetson AGX Xavier.

    So coming back to the same model, we can once again go into Configuration Settings. In Hardware Implementation, we want to make sure that we're selecting the NVIDIA Jetson as our hardware board that we'll be targeting.

    From there, under target hardware resources, we can specify the board parameters. These are the device address, the USERNAME and the password in order to log into it.

    Build action-- we're going to stick with building and running the application. You can also specify the build directory. Once we have that, we can go into Hardware. We can click on Build, Deploy, and Start.

    So since we're deployed to the NVIDIA Jetson, it's going to take a bit of time. So we jumped ahead. But here you can see the output from the diagnostic viewers showing us that we've compiled the application and it's running on the Jetson Xavier. The video display here is letting us see the output from the Jetson.

    And what you saw there was that frame rate was a little bit slower than what we saw on our desktop GPU. And that's to be expected, of course. With the embedded GPUs, it's not going to be quite as powerful as what we have on our desktop GPUs.

    So a quick recap of what we saw here with the lane and vehicle detector. We saw this running on both the CPU as well as our desktop CPU. Running on the GPU definitely sped things up.

    Behind the scenes, we measure it. It was roughly about seven times faster running on the GPU compared to running on our desktop CPU here.

    And the third part is we saw that we were able to generate CUDA code and run this with fairly minimal changes in our configuration in our Simulink model and switch it to run on the embedded Jetson AGX Xavier.

    So in the previous example, we talked a little bit about accessing hardware, so your embedded Jetson or drive platforms. And there's actually a couple of different ways to make use of and access these hardware.

    So one of these is to be able to access peripherals from Simulink. And using the GPU Coder Support Package for NVIDIA GPUs, you'll be able to access the peripherals on the board, for example, the webcam.

    You can, of course, also deploy standalone applications. That's essentially what we just saw. Once you generate a code, deployed it down to the Jetson or drive platforms, then you can disconnect it from your host computer. And you'll be able to run this independently.

    A third way to access the hardware is to run the standard software-in-loop, processor-in-loop, or external mode that pretty much all the Simulink users out there are quite familiar with. And in terms of support for peripherals on the NVIDIA boards, here you can see in the release 2020b.

    When you install the GPU Coder Support Package for NVIDIA GPUs, you get access to these different peripheral blocks. So some of these are for communications, some of these for networking, and some of these for accessing video, like the webcam or the video display.

    So let's take a quick look at one of these ways of accessing peripherals. And that's tuning parameters and logging signals when using external mode. So in this case, we're going to use another Simulink model.

    Here is a Sobel Edge Detector. So there is no deep learning in this particular example. But if we take a look at this inside the Sobol Edge Detector, you can see the input image being fed in in here. And the Sobel Edge Detector algorithm itself is pretty straightforward. There's no deep learning here, just some fairly basic MATLAB code doing some convolutions.

    We also have one of the parameters' threshold, that it's being fed in. And that slider bar at the bottom will help us to adjust it as we go through. So go into Configuration Parameters, we want to make sure that, again, the hardware is set to the correct one, NVIDIA Jetson. Here are all the board parameters, the build options that we saw before.

    The third tab here is External Mode. And here you can see the two communication interfaces available. So we want to select TCP/IP. And from there we can click on Monitor and Tune.

    So it does take a bit of time for us to connect to the board. So we'll jump ahead a little bit. But here you can see the output from the application running on the Jetson board. So that's the image of the bell peppers.

    And as we slide the threshold value, you can see that we have less edges with higher thresholds. And if we lower the thresholds, obviously the algorithm will detect many more edges in the input image. So that's a quick look at tuning parameters in real time and seeing the results of tuning the parameters from the video display.

    So going back to a little bit earlier in our presentation, we talked a little bit about performance. Now the question is, how does GPU Coder achieve some of the performance benchmarks that we saw earlier. So GPU Coder is a compiler. And it applies various optimizations.

    So in addition to the traditional compiler optimizations that we do, there's also various loop optimizations and CUDA kernel lowering that we do.

    So in the interest of time, we won't be able to go through too many of these. But at a high level, there are really three areas that we spend our time on. So we use optimization libraries, we look at deep learning network optimizations and also try to make use of coding patterns.

    So the first of these, which we talked about a couple of times a little bit earlier, is that the generated code calls optimization libraries. So for the deep learning parts, depending on the hardware you're going to, we'll use the optimization libraries that's available.

    So for example, if you're targeting an Intel CPU, then we'll make use of the Intel MKL-DNN or the oneDNN library. For NVIDIA, TensorRT well as cuDNN. And for ARM Cortex-A processors, we make use of the ARM Compute Library.

    Now for the non-deep-learning parts, the pre-processing and post-processing, then GPU Coder will try to make the generated code use many of these optimization libraries as much as it can. CuFFT, cuBLAS, cuSOLVER, and Thrust are some of the libraries that we support today.

    Now in the second area, we also take a look at the deep learning networks and some of the network optimizations that we can do here. So an example is, here's a snippet of a particular deep learning network. It has a couple of different layers, as you can see. And it branches out before converging back towards the bottom.

    Now if you look at some of these parts, you can see that we can actually apply some optimizations here. So for the layers that we've circled. We can actually fuse them together. And we can apply layer fusion. And that will help to reduce the number of computations needed in order to process those branches.

    Now in addition to layer fusion, we can also apply buffer minimization techniques. So here you can see at the output of these different layers we're going to have various different buffers, buffers A, B, C, D, and E.

    And some of these buffers really aren't necessary. For example, in place of buffer C, we can actually reuse buffer A. And same thing applies with buffer E where we can reuse buffer B. So these type of buffer minimization helps to optimize the amount of memory that is needed.

    Third area is around coding patterns. So here's an example of a coding pattern that you could use. These are called stencil kernels. With a lot of these image processing and video processing algorithms, you might have this type of coding pattern where you're applying a kernel to a specific part of an image. And you're just sliding your way across the image.

    So GPUs are particularly good at these type of processing, if you will, or these type of patterns. And so whenever you are using this kind of pattern, then you can use these coding patterns, like stencil kernels or matrix-matrix kernels, in order to optimize the performance of the generated code.

    So wrapping up, as we talked about towards the beginning of the webinar, over the past couple years, we see many of our customers start adopting deep learning in their designs. MATLAB and Simulink provides a deep learning framework that enables you to design and train these deep learning network, create a complete application around it, including pre and post-processing, and deploy it, the entire application, to a variety of different hardware.

    So these include targeting Intel and ARM CPUs, using MKL-DNN or oneDNN library in the ARM compute library. And of course, targeting NVIDIA GPUs using their cuDNN or TensorRT frameworks.

    I hope this presentation was helpful for you. And thank you very much for watching.