Description

Automatic CUDA Code Generation and Deployment on Embedded Platforms

Overview

NVIDIA GPUs are the hardware of choice for many applications, such as autonomous systems, deep learning, signal and image processing. MATLAB is the ideal environment for exploring, developing and prototyping algorithms. In this seminar, we will learn how to generate CUDA code directly from MATLAB to run on NVIDIA GPUs using GPU Coder.

Highlights

GPU Coder converts your MATLAB algorithms to CUDA without being a CUDA expert
GPU Coder optimizes instructions and memory operations to generate efficient code

About the Presenter

Rishu Gupta is a senior application engineer at MathWorks India. He primarily focuses on image processing, computer vision, and deep learning applications. Rishu has over nine years of experience working on applications related to visual contents. He previously worked as a scientist at LG Soft India in the Research and Development unit. He has published and reviewed papers in multiple peer-reviewed conferences and journals. Rishu holds a bachelor’s degree in electronics and communication engineering from BIET Jhansi, a master’s in visual contents from Dongseo University, South Korea, working on the application of computer vision, and a Ph.D. in electrical engineering from University Technology Petronas, Malaysia with a focus on biomedical image processing for ultrasound images.

Recorded: 11 Aug 2021

Full Transcript

We already completed the three, and today is about automatic CUDA Code generation and deployment on embedded platforms. This is the last session of this deep learning webinar series, and as a follow up to this session, we also plan to have hands-on deep learning virtual workshop, which is by invite only. Now, quickly setting up the agenda for our today's discussion, we will do a quick recap to the previous sessions.

Then, we will go in detail into our today's discussion around GPU Coder, which facilitates CUDA Code generation for embedded targets, or Invidia GPUs. Then, we will go ahead and talk about algorithm design, development, as well as code generation for deployment on embedded platforms. We will also talk about how good is the generated code, what is the overall performance of the generated code as compared to other frameworks? So, here we will be showing you multiple benchmarks that we have done against other frameworks, as well.

Now, at the end we are also going to talk about how we can, at MathWorks, support you in your deep learning journey. Now, without any further delay, let me get started with the recap of the previous sessions, and as a part of this deep learning webinar series and a deep learning workflow, we have talked about data preparation in the first. Series Then, the second and third part of the series, we are more focused around AI modeling, modeling that deep neural network architecture and that wants deep neural network architecture. And then, today's session is more around deployment. That is, once you have your algorithm ready, or once you have your trained model, how you can go ahead and do code generation and deploy the trained model onto embedded platform.

As a part of the second session AI modeling, we also talked about the iterative nature of deep learning, wherein we went ahead and spoke about how you can build advanced deep neural network architecture like Gans and MATLAB. Now, let me go a little bit detailed on the recap itself. On the data preparation side, we mentioned there could be a lot of different kind of data you could be working on while doing deep learning. It could expand from images, signals, numeric, text, and point cloud data, or many other forms. Now, MATLAB can help you in labeling of all these different data types, and to that purpose, there are a lot of labeler apps which are available in MATLAB. Depending on the kind of data you are working on, it could be imageLabeler, videoLabeler, groundTruthLabeler, lidarLabeler for point cloud data. For signal data, it could be audioLabeler or signalLabeler.

Now, in the course of the discussion we also understood that labeling is a very tedious job, time consuming, but is also absolutely necessary. We went ahead and took all these apps, went detailed into all these workflows, and talked as to how you can extend these labelers with a custom class to facilitate on the fly automation. Also, we introduced a tool named as auto-bound tool that you can leverage with the fall back functions that you have created as a pre-processing unit to automate the labeling process. We also discussed around multiple apps, which can facilitate developing the masks, or pre-processing components. Also, we talked how you can incorporate your artificial intelligence models, built over machine learning, deep learning into these labeler apps to automate the labeling process.

Now, in the second and third discussion we talked about the modeling, where we discussed how you can build your deep neural network architecture. And in the first part of this, we talked about a couple of apps, which was Deep Network Designer and Experiment Manager. Deep Network Designer is an app that you can access the pre-trained model, visualize and import a pre-trained model, create a model from scratch, analyze the model. Also, you can train the model and iterate over it by multiple different hyperparameters Once you are done, you can also generate an equivalent MATLAB code for the entire development you have done in the Deep Network Designer and then take that generated code and use it for Experiment Manager.

In Experiment Manager, you can now manage your experiments by exploring multiple hyperparameters that you may need to tune, while doing deep learning by monitoring the training progress plot for individual iteration of the neural network. By defining your own experiments with the MATLAB code as to what all different kind of hyperparameters you wanted to tune in your neural network, they could expand from different kind of solvers who want to try, different kind of data you are working on, different kinds of neural network architecture, or hyperparameters like learning rate or momentum in the case of adam/rmsprop algorithms.

It also helps to recreate your entire research that you have done and track the results. You can also export the model settings and results for tracking of your entire development and dissolving and seeing how the deep neural network development has evolved for your particular application and use case. As a second part of this deep learning modeling, we talked about the extended framework, which will help you to develop models like GANs, if you are working on generative adversarial networks, Variational Autoencoders, if you are working on problems like anomaly detection, image super-resolution, image denoising, and others, if you are working on more comparative kind of neural networks, like Siamese networks, or if you are working on attention mechanism kind of a problem, whereby looking at the image you wanted to understand the context in the image.

All this development is possible with the help of extended framework that we have introduced in the part two, where we explored and talked about multiple different capabilities that is introduced as a part of extended frameworks. We introduce the dlarray, which is a data container to help you facilitate doing deep learning, training, and other stuff around data for extended framework. Then, we have introduced dlnetwork, which is a network container for your use case of extended framework.

We have also introduced dlfeval, which can evaluate the loss function that you've implemented for the custom neural network architecture that we are building. We also introduced the capabilities like automatic differentiation and back propagation with the help of dlgradient and dlupdate. We also told the extended framework give you gives you full flexibility at the functional level, as well as the aircraft level, to define your own neural network architecture and do the training and build on top of it.

Now, before going any further in our today's discussion, I would request you to quickly go inside the polling window and help me with our poll answer. The question that I have is where would you put yourself while working on artificial intelligence systems. Are actively working on a project involving AI? Are you looking forward to working on a project in two to four months? Are you acquiring knowledge in the respective areas to build competence or none of the above? I'll wait for a few seconds for you to answer the question. Thank you all the folks who have already answered the question. Please continue to answer the question in the polling window. The poll question would be up for another 30 seconds.

In the interest of time, let me go ahead with the GPU Coder overview. Now, when we are talking about GPUs and CUDA Code, let's take a step back and focus first on what is GPU. Now, if you look at any modern GPU from NVIDIA and if you look inside of it, you will see that in addition to the GPU CUDA Codes, which are doing all the parallel processing, there are also ARM processors for general computing. In order to program these GPUs, you will need to program in a proprietary language called CUDA. CUDA is actually a super set of C/C++

So, when you look inside these CUDA programs, you will see that much of it resembles your standard C/C++ code, but there are also these constructs called CUDA kernels, which run in parallel on the CUDA Cores. Once you compile the CUDA program, the CUDA kernels will run on the CUDA Cores, and the C/C++ code will run on the ARM processor. If you look inside these CUDA Cores, you will notice that it has its own memory. The ARM processor will, of course, have its own memory, as well. So, when you're working with CUDA, you will have to take care of managing variables that exist in these two distinct memory spaces

Now, what are some of the challenges of programming out by hand? One of the first things is that you will have to learn to program in CUDA, and beyond learning the mechanics of programming CUDA, you will need to learn conceptually how to write your algorithm to take advantage of the parallel processing paradigm that's inherited in GPUs. And part of that is learning how to create CUDA kernels that will run on the CUDA Cores. You will need to analyze your algorithms to create CUDA kernels that maximize parallel processing.

I also mentioned that there are different memory spaces between the GPU cores and CPU, so you will need to deal with all the memory allocation in each memory space. And finally, as you work with CUDA programs, you will need to minimize the amount of data transfer between the GPU and CPU because if you are working with large amount of data, having too many data transfers will kill the performance you gain from parallel processing. So, we need to minimize the amount of data transfers that happen between the CPU and GPU.

Now, if you are using a tool which can do automatic CUDA Code generation for you like GPU Coder, the entire process can really help you deploy your algorithms onto GPUs much faster. In terms of CUDA kernel creation, GPU Coder will run dependency analysis and loop optimisations so that it can automatically partition your MATLAB algorithm to run on the GPU Cores and ARM processors. In addition, it will take a look at the MATLAB functions you are using, and if there are equivalent CUDA optimized functions that it can use, it will automatically map it to that so you get the best possible performance. For memory allocation, it is going to figure out what the best memory allocation will be for the GPUs.

So, actually, GPUs have quite a deep memory hierarchy, and you can allocate data on local memory, global memory, constant memory, shared memory, and others. And GPU Coder will automatically determine the best memory space to allocate the memory to. GPUs also have textual memory, which we are progressively working on, and finally GPU Coder will run different sorts of analysis, like data dependency analysis and apply techniques like dynamic memcopy reduction to minimize the amount of data that is transferred between the GPU and CPU memory spaces.

Let's go a little deeper to understand more about CUDA kernels and parallelization. And in that, let's take a look at how GPU Coder generates CUDA Code from MATLAB. Let's take a simple saxpy algorithm, and it's computing the output of ax+y, where the variables a and x and y are the matrices. You can see the operation in two forms, a scalarized MATLAB inside of a for loop, as well as one line of vectorized MATLAB code that does the exact same thing. In both cases, the loops and matrix operations are directly compiled into CUDA kernels. GPU Coder will turn it into this short piece of CUDA Code, which I'll walk you through.

Towards the beginning, we have the CUDAMalloc calls that will allocate memory on the GPU. It will then copy the data from the CPU memory space to the GPU memory space using CUDA memcopy calls. We then launch the saxpy kernel one, which is a CUDA kernel, and if you look inside of that, you can see the CUDA Code for that below. Once the CUDA kernel completes, there's a CUDA memcopy call to copy the data back from the GPU to the CPU, and finally we are freeing up all the space variables on the GPU. with the help of CUDA free.

Now, GPU Coder will automatically analyze your MATLAB algorithm and compile it into as much efficient CUDA kernels as possible, and like I said, for memory allocation, GPU Coder supports different kind of memories like local, global, shared, as well as constant. Now, the generated CUDA code is optimized for memory performance, as well. Let's consider this piece of code that's calculating the Mandelbrot space, which is a fractal algorithm. Now, if you use GPU Coder, you can see the generated code on the right. You will see here that towards the beginning it does much of the same thing. It allocates memory on the GPU, copies the input data to the GPU using CUDA memcopy.

From there, we have three CUDA kernels called kernel one, kernel three, and kernel two. You can look at the inside of the kernel call, as well. So, the entire generated code, with the help of GPU Coder, is fully flexible, open to you, and a standalone code with no dependency on MATLAB. Now, what I want you to highlight here is that, between the calls to kernel one, three, and two, there are no additional data copies to move the data back and forth from GPU to the CPU and vise versa. Kernel data location is automatically optimized, and we are taking care of data transfer minimization by analyzing the data dependency between the CPU and GPU partitions to determine a minimum set of locations where data must be copied between GPU and CPU.

Data shared between the CPU and GPU is allocated on GPU memory using CUDAMalloc and CUDAMallocManaged for unified memory, which determines minimum number of CUDA device sync calls. Now, on top of these memory optimizations, as well as memory transfer taken care of, the coder also applies various other optimizations. and here's a short snippet of some of the optimizations done by GPU Coder. In the interest of time, let's talk about three main types of optimizations that are done by the coder. First is use of the optimized libraries provided by the target, or specific vendor, second, for deep learning doing specific network level optimisations, and third, looking at coding patterns that leverage memory optimizations, like stencil kernels.

Let me go ahead and talk about the optimized libraries. Now, one set of optimization is done by plugging the generated code into optimized libraries. Now, for deep learning networks, we plug into either NVIDIA's TensorRT or cuDNN Libraries. For Intel, it is MKL-DNN, and for ARM, it is ACL, which is the ARM Compute Library, depending upon the target. These are vendor-provided libraries so they can provide the best performance possible on their hardware platforms. And for the pre-procession and post-processing around the neural network, in addition to creating the most optimized CUDA kernels, we plug into optimized libraries like cuFFT, cuBLAS, and cuSolver, and Thrust, and other libraries that offer the best performance possible and are provided by the vendors for their specific hardware.

Now, moving on. In the case of deep learning, the coders apply optimizations specific for neural networks. For example, we might have a network where part of it is as shown. Now, instead of six layers, we can fuse them together into two fused layers called FusedConv and FusedConv BatchNormAdd. The fused layers reduces the amount of computation needed. In addition, we can optimize the amount of memory used at the outputs of the layer we used buffer to restore results, and if you look closely, you can see that buffer C isn't needed at all. We can reduce buffer A. The same is true for buffer E. We can just reuse buffer B.

So, buffer minimization helps to optimize memory, as well. Now, the third type of optimizations is we look into the coding patterns and in this case we use stencil kernels. You see this type of coding pattern frequently with image processing algorithms where we need to apply a convolutional kernel on an import image in order to compute the output image. So, here we automatically apply this coding pattern when you use image processing functions like imfilter, iamerode, iamdilate, or convolutional, but you can also manually apply the coding pattern by using the gpucoder.stencilKernel pragma.

Another type of coding pattern is the matrix-matrix kernel, which is applied automatically when you use MATLAB functions like matchFeatures, SAD, or SSD, or evaluating distance by pdist. Again, you can also manually apply the coding patterns by using special kernels, and in this case it could be gpucoder.matrixMatrixKernel pragma. Now, for both stencil and matrix-matrix kernels, underneath the hood, the coders apply optimizations like using shared memory on GPUs which helps boost performance. So, these three types of optimizations give you a taste of why the coders are able to generate code that gives great performance.

That was around some of the optimizations that GPU Coder does for the generated code. Now, let me go a little bit deeper as to talking about a test event, embedded deployment workflow, and how you can take your algorithm from MATLAB or Simulink environment and apply it onto an embedded target. Since we are talking about deep learning and deep neural network architecture, there could be a lot of ways in which we could be getting a neural network architecture and different types of neural network architecture. It could be image classification network, semantic segmentation network, or it could be object detectors, or if you are working on signals, then it could be LSTM or BiLSTM kind of neural network architectures.

Now, you could be working on MATLAB for building or training this kind of AI networks, or you could leverage interoperability feature, wherein you might have the networks trained in other frameworks and imported in MATLAB where the direct API, like for TensorFlow and Caffe, or you could import them by ONNX, which is open neural network exchange. Now, if you are working on any other framework like PyTorch mxnet, or ML Toolkit or Trainer, which supports ONNX, you can directly export the model trained in PyTorch, import it inside ONNX, and then from ONNX you can directly imported inside MATLAB. And once you have the network imported inside MATLAB, you can leverage it for almost anything. We have talked about that from the labeling perspective, now we are also going to talk about from the code generation perspective.

When we are talking about deep learning, we understand that you could be working on a much bigger application because deep learning, in general, is a part of a bigger application. Let me talk about a simple test event of automated driving. Now, if you talk about an automated driving test event, there could be a lot of pieces to it. There could be three scenarios, visualizations happening, where you need to create a scene or a scenario. Then, there could be self-sufficient algorithms if you have multiple sensors working together. Then, there has to be a controller, which is the brain of the ADOS logic, and then there is a vehicle dynamics piece that incorporates what brain says and rapidly end acuates the vehicle.

Now, all this is a part of one of the Highway Lane Following test bench available as a documented example in MathWorks Automated Driving Toolbox. And on top of all of this, there is also something called laser vision detector, or vision detector model, wherein you may have a lot of AI models like object detection or late detection doing for you within the same integrated framework.

What happens when you try to leverage GPU Coder for generating code for an entire test event? Or for a piece of a test event. Now, to successfully transfer, or translate, more complex and real-world algorithms, we recommend this three-step workflow to use GPU Coder. The first step is to prepare your MATLAB algorithm for code generation, and here you examine and modify your original MATLAB algorithm to introduce implementation considerations that are needed for C and use the MATLAB language features supported for code generation. The second step is to test if the MATLAB code you just modified is ready for code generation using default settings. If successful, MATLAB Coder or GPU Colder will generate a mixed file for you. If not, you will have to iterate with the previous step to get your algorithm to a point where you can successfully generate a max function from your MATLAB algorithm.

Once you have gotten past the second step, you can generate either coder the source code or keep the max function from the previous system. You can iterate with these steps one and two to further modify your MATLAB code to optimize the generated coder code for its look, feel, memory, speed, performance, or optimize the max function for performance itself.

Now, from that entire Highway Lane Following Automated Driving use case, let me pull out and perception unit, and of that perception unit we have a couple of neural networks. One is AlexNet-based regression model, which is doing lane detection for me, and another is Yolo v2 based object detection model. Now, to facilitate incorporating AI into your Siimulink models or AI inside MATLAB, for Simulink specifically, we have provided a couple of blocks inside the planning tool box library in Simulink browser. And you can use these blocks to incorporate artificial intelligence directly into your test event or plant model. Also, you can use MATLAB function blocks for the networks, which are not supported directly by image classification block to leverage and do code generation for object detection kind of networks.

Now, in this particular test event we are getting that input directly from a camera on the and with the ejection board doing some preprocessing, feeding the preprocessed video inside a deep neural network for Lane detection and vehicle detection, getting the output, performing some post-processing, and projecting back those bounding boxes and lanes onto board coordinate system or vehicle coordinate system. Now, if you talk about an entire workflow, how would it go? First, you would want to create your entire test event inside Simulink, and you may want to run your simulation on the desktop CPU.

Once you are happy with the entire testament and the simulation, you may want to scale the computation by leveraging the GPU on your desktop machine. Once that is done and you are happy with the outcome, you may want to generate the CUDA code and test that generated CUDA code onto your host PC, which is that desktop GPU or CPU machine.

Now, once you are happy with the overall generated coded, it's accuracy and performance, then the next step would be taking that generated code and applying it onto your embedded platform, which in this case could be Jetson's Xavier board. Now, let's go ahead and talk about first simulating or creating a test event inside Simulink. Here is the test event that we were referring to. There are two deep neural networks, lane detection and vehicle detection. The traffic video input is coming, the resize is happening, which is being fed to the lane detection, and the lane detection block is out of the blocks which is provided. You can just import the trained network there. There is some post-processing also happening to overlay those detected coordinates onto a vehicle coordinate system and pushing it back onto a video.

Now, the traffic video is also going inside a object detector, which is YOLO v2 this case, and here we have implemented YOLO v2 with the help of MATLAB function block. And what we are doing inside is we are just calling in the YOLO v2 and doing inference, passing on those outputs to the annotation to project it back onto the coordinate system. And this is how the output may look like, the left you see, the input one, wherein you are getting an input directly from the CSI camera, or a USB camera on the off-line machine, and on the right what you are seeing is the output overlaid onto the input video. And you can do all the simulation just by running the run button on to that tool district.

Now, once you are happy with the test event on the desktop machine, the next you may want to do is you may want to go ahead and scale your computations onto NVIDIA GPUs, and for that you may want to leverage NVIDIA GPU first to see the test event and then also generate CUDA code and test the CUDA code from within the platform itself. And while you are doing the code generation, or while you are testing out the generated code onto NVIDIA GPUs, specifically for the deep neural networks, you may want to leverage optimized libraries like cuDNN and TensorRT, and for preprocessing or post-processing units, you may want to leverage other CUDA-optimized code or libraries like cuBLAS, cuSolver, cuFFT, or Thrust.

Now, last time what we did was we just ran the code on the CPU. Now, you can go inside the modeling options, click on models, and you can start to choose a different simulation target, and I can enable the GPU acceleration and also choose what kind of library I want to leverage, whether it is a cuDNN library or it is a TensorRT library. I can click OK, and once OK, I can quickly go ahead and run the simulation model.

Now, this particular simulation right now what I'm running is without code generation onto GPU. So, initially from running the model in CPU, now I am running the model with the GPU that is there on my host PC. Now, the next step that we may want to do is you may want to go ahead and generate CUDA code and leverage that CUDA code for entire deployment and acceleration. I can go inside them embedded coder settings and now inside the settings I will go to code generation section. I can choose my system target file. I can say I need to generate the GPU code, and I can choose the toolchain which I wanted to leverage while building an executable order library.

Now, once that is set up, I can go to the interface to mention what library I wanted to leverage while doing co-generation, which could be either cuDNN or TensorRT. Now, on the type of GPU code that I want to generate, I can choose whether I want to auto-generate custom libraries calls like cuBLAS, cuSolver, cuFFT, or I don't want it, and once that is all done, I can just click on build, and it will automatically generate the CUDA code for me.

Now, let me walk you through with a generated CUDA code. The generated CUDA code is entirely modifiable and open. It starts with a CUDAMalloc, which is doing the memory allocation for their application. Then, it does CUDA memcopies to copy the data back and forth from the GPU and the CPUs, and in between these memcopies copies, you will also see different kernels, which are being generated, which are giving calls to our deep neural networks for detection. You can also look at the deep neural network code which is generated and specifically look at two functions, setup up and predict.

Predict Is for doing the inference, and setup function, particularly, contains all the information around the deep neural network architecture itself. Let me just go ahead and click at setup, and if you see inside the setup function, it gives calls to all the different layers that are there inside deep neural network architecture, starting with the input layer, then Conv layer, and all, and .bin files is where the rates are getting is stored for the deep neural network architecture. So, the entire generated code is fully open standalone, which you can integrate into your larger application, or just do closed loop simulation inside Simulink model itself.

Until now what we did was all on the host PC, wherein we were generating the code, doing the modeling loop, or software and loop testing. Now, the next step would be to take the generated code onto my embedded platform, and that could be Jetson AGX Xavier board. Now, by remaining inside the MATLAB framework you can push your generated code onto embedded platform, as well, and to do that you again would need to go to the settings, and inside settings now you will go to hardware implementation and choose what kind of hardware platform you want. Is it a Jetson board, Drive board, or NVIDIA CPUs? You will also parameterize the device address, username, and password so that app is able to access the board. You can choose what action you want. Do you just want to build, or you also want to build and run? You can also choose whether you want to display it onto a monitor or not.

Once all that is done, you can go inside the hardware section and just build and upload the application onto an embedded platform. And here, what you see right now is the generated code running onto your NVIDIA and AGX board independently, directly from inside MATLAB. Now, what we saw here was we ran the code on the CPU GPU, then we moved on to a desktop GPU, and we got some seven times faster inference, and then we generated the CUDA code and ran that generated CUDA code out of the box onto the Jetson AGX Xavier platform. Now, what we were seeing here was the Simulink test event. If you are doing this entire development inside MATLAB, then the entire workflow for code generation is supported by MATLAB, as well. You can leverage GPU Coder app to exactly replicate the same process that we have done here to do auto code generation and apply on to embedded target, whether it is Jetson Nano, Jetson drive, or Jetson targets.

Now, specifically for NVIDIA GPUs, not only code generation, which could be a single precision generated code, to further optimize and give you higher performance on the generated code, you can also use something called the Deep Network Quantizer App. Now, inside the Deep Network Quantizer App, you can import the network, bring in the calibration data, which could be representative of the dynamic range of your inference, quantize, validate that deep neural network architecture, and also visualize the dynamic range for the multiple layers inside the neural network architecture. And once you are done with the quantization, you can directly take that quantized code for deployment on embedded targets.

We also did some benchmarking on the accuracy and throughput for the quantization, and what we found was for top 1% accuracy for image classification kind of a network, when we do the conversation from fp 32 to INT8 precision, the accuracy drop is not significant. However, the throughput increases significantly when we are processing large batch sizes. For single batch the performance improvement is not too much, but when we are doing large batch sizes the performance improvement could be significant, and in many of these use cases, what we have also observed is the memory footprint of the neural network architecture can significantly reduce, and it could be in some cases to 50% to 60%, as well.

Now, in the example we just saw, I showed you one way to access the hardware, and there are actually three ways in which you can work with the hardware. The first, while you are developing your algorithm, you can access the peripherals from MATLAB, So, you can connect MATLAB to the hardware board, and if you need to get real data from the webcam connected to the board, you can bring that into MATLAB to test the algorithm you are working on. The second use case is what we just saw in the previous example, and in this case, we generated the code, compiled it, and deployed it onto the hardware of your choice. You can then disconnect MATLAB from the hardware, and it will operate in a standalone mode.

Lastly, you can also run procession an loop testing, and in this case, you first generate code for the algorithm that you want to test and get it running on the target processor. You then send test vector from MATLAB to the target, which computes the result, and that output is sent back to MATLAB, where you can compare with results from MATLAB and see if there are any differences. You can also connect to external peripherals like webcam, or you can also import the data from multiple different protocols like TCP/IP or UDP.

There are Simulink blocks available inside the GPU Coder support package for NVIDIA, which will allow you to communicate over the network and also bring in the video data directly from the CSI camera on board the NVIDIA GPUs, or connect a camera onto a USB on NVIDIA GPUs and bring in the feed.

Now, other than NVIDIA GPUs, for deployment MATLAB got a unique generation framework that allows models developed in MATLAB to be deployed anywhere without having to rewrite the original model, and this gives you the ability to test and deploy the entire system and not just the model itself. MATLAB can generate native optimized code for multiple frameworks. It gives you the flexibility to deploy to lightweight, lower power embedded devices, such as those used in cars or other embedded platforms, low cost rapid prototyping boards, such as Raspberry Pi, which is an ARM target, or Edge-based IoT applications, such as a sensor and a controller on a machine in a factory. You can also do code generation for a specific FPGAs like ZCU102, ZCU106, or intelligence sources Also, if you are interested in taking the generated code and applying it onto any other platform, they can also help you generate target diagnostic C/C++ code, which can then be taken to any platform of your choice.

Now, we talked about the entire workflow for generating the code, but when we are talking about C/C++ CUDA code, one thing that is very important is how good is the generated code performance. Now, at MathWorks, we are dedicated to improving the performance of people learning in MATLAB, and we have a dedicated process in place to measure, track, and action performance. This enables us to focus our development efforts in the right areas. The following slide highlights the result of these efforts. A good example of this is MATLAB performance on CPU. There, inference performance, with the extended framework dlnetwork, which we talked in the last discussion, is 70% faster in just one release.

Now, deep learning inference with MATLAB is fast. On CPU, MATLAB platforms outperforms both TensorFlow and PyTorch, and this is the inference done on Intel Xeon CPU and TensorFlow version 2.4.1 0.1 and PyTorch 1.6.0 for ResNet 50, which is a classification kind of a network. This performance further increases when we start using the coder like MATLAB Coder, which generates native C/C++ code, and you can see the generated code gives much faster inference onto CPUs.

Now, coming onto the GPU, MATLAB's performance on the GPU is comparable for processing large data sets, as compared to PyTorch and TensorFlow. While doing inference France for a smaller data sets, there are options to improve performance. Now, first is through MEX acceleration, which you can see greatly improved the performance. GPU MEX acceleration generates optimized C/C++ CUDA code for deep learning and brings added performance.

GPU MEX acceleration is available as a hardware support package when using predict, and it does not require additional licensing for GPU Coder, and it gives you significant acceleration for inference when you have a GPU on board. The second option is going through GPU Coder. GPU Coder can achieve this performance as it generates native optimized CUDA code, as opposed to requiring a runtime on the target hardware. And in the case of PyTorch, TensorFlow, there could be an overhead of Python running on it.

Since we are talking about benchmark, I'll also talk a little bit on the training benchmarks, and training is comparable in MATLAB to other frameworks. Here you can see that the time to calculate an average epoch, that is time to pause the entire data set through a training iteration, is similar to that of TensorFlow. Now, without going any further, I would request you to please go inside the polling window and help me with another question.

The first question that I have is what is embedded target you want to deploy your algorithm on? Are these NVIDIA GPUs, are these ARM CPUs, or ARM GPUs ? Is it Intel CPU, or Intel integrated graphics card? Are these FPGAs or any of the other targets that you're are working on? Second question that I have is who do you like to have a subsequent technical discussion for your application, or use case, with our technical experts? I will pause here for a few seconds so that it gives you enough time to answer these questions and help us connect back with you with a more focused content.

Thanks for all who have answered the question. Please, continue to answer the questions in the polling window if you have not yet answered. Now, going ahead how MathWorks can support you in your deep learning journey entirely, and in that first thing that I wanted to talk about is MathWorks has been recognized as a leader by Gartner for consecutive two years now for data science and machine learning platforms. And the reason for that is the completeness of the vision of solutions that we see for artificial intelligence.

We compete. We can help you in the complete workflow for artificial intelligence applications, starting from data preparation, data labeling, modeling of your AI models, architectures, integrating it with your bigger systems, plant models, and also helping you with the production code generation, whether you want to deploy your applications onto embedded devices, Edge devices, or enterprise systems, or in the cloud. There are a lot of causes which are available for you, freely, to help get you quickly started with deep learning in MATLAB.

If you are new to MATLAB, Simulink, there are online courses for MATLAB and Simulink, and if you're working specifically on deep learning, machine learning then they are specific to our introductory courses on deep learning and machine learning. On top of that, if you further want to go ahead and learn more about deep learning in MATLAB, then there are more dedicated courses, which are two-day courses, very exhaustive with our experts on deep learning with MATLAB and machine learning in MATLAB.

Also, there are courses for core MATLAB, around MATLAB fundamentals, data processing, visualizations, image processing, or computational mathematics. MathWorks also can help you by coming on board via our consulting, and the philosophy behind consulting is to enable you to maximize the ROI for our tools, and that would be by reducing the learning curve, by helping you with quicker implementations of your objectives, proposals, by helping you avoid common mistakes, which you may do if you are getting started. Our consulting is open and transparent, and we do help you with the knowledge transfer of all the development that we have done with you for you so that in future you are independent and on your own for the development.

This concludes our fourth session in the deep learning webinar series. We would connect back with you on the hands-on deep learning virtual workshop, and as we have mentioned, that it is by invite only. So, stay tuned and look forward to seeing you in the hands on deep learning workshop. Until then, enjoy deep learning and see you in the hands-on session, and thank you all for joining the deep learning webinar series.