Fitting AI Models for Embedded Deployment - MATLAB
Video length is 23:29

Fitting AI Models for Embedded Deployment

AI is no longer limited to powerful computing environments such as GPUs or high-end CPUs,  and is often integrated into systems with limited resources like patient monitoring, diagnostic systems in vehicles, and manufacturing equipment. Fitting AI onto hardware with limited memory and power supply requires deliberate trade-offs between size of model, accuracy, inference speed, and power consumption—and that process is still challenging in many frameworks for AI development.

Optimizing AI models for limited hardware generally proceeds in these three steps:

  • Model Selection: Identify less complex models and neural networks that still achieve the required accuracy
  • Size Reduction: Tune the hyperparameters to generate a more compact model or prune the neural network
  • Quantization: Further reduce size by quantizing model parameters

Additionally, especially for signal and text problems, feature extraction and selection result in more compact models. This talk demonstrates model compression techniques in MATLAB® and Simulink® by fitting a machine learning model and pruning a convolutional network for an intelligent hearing aid.

Published: 25 May 2022

Welcome to the session on Fitting AI for Embedded Deployment. My name is Bernhard Suhm, and I'm the Product Marketing Manager for the Statistics and Machine Learning Toolbox, working with both you, our customers, and development, to make it easier to bring AI into embedded applications.

And my name is Emelie and I'm an Application Engineer at MathWorks focusing on our AI tools and workflows.

Emelie, didn't you mention some recent customer interaction that you wanted to share with me?

That's right. So a customer just left me a voicemail. And let me play that for you.

Hey, MathWorks. I need some help. As I mentioned in our last meeting, we are looking at adding intelligent alerts on our agricultural machines. And now my colleagues from the Data Science Department have given me new AI models for anomaly detection. They want me to put these AI models on the hardware platform but they are way too large. I mean, models are great. Very impressive accuracy and all. But I just don't see how we will be able to deploy them without making any big changes to our hardware setup. And changing the setup is just not possible. We spent a lot of time certifying the control units in the setup they are right now. So it is not on the roadmap to do any changes to that. Can we talk later this week?

So this call was very timely, because in this talk, you'll learn a process for compressing the size of your AI models so that they do fit on resource limited hardware like this customer has.

That's actually a story we hear from many industries, Emelie. And there is a range of hardware depending on the industry in use for deploying AI. Microcontrollers in heavy machinery and cars, PLCs on manufacturing floors, FPGA in wireless base stations and routers, and low power chips in wearable devices and hearing aids. And you'll hear a lot more about that since we chose that for our demo. In all these applications, embedded AI is the only option.

But let's take a step back and see where edge or embedded comes into play in the hardware landscape. So on the left hand side on the very high end considering memory footprint and compute speed, cloud servers offer many gigabytes of memory and teraflops of compute. Going down the arrow here, we see desktop CPUs and GPUs, which still provide a couple of gigabytes and Giga flops. At the bottom, we have embedded hardware, which may have as little as one kilobytes in on chip memory and just be in the mega flops range for compute.

Today's talk focuses on the low end of the spectrum and it's structured as follows. Next, we'll explain what edge or embedded AI means and what its challenges are. We also describe a high level workflow for model compression that applies to both machine learning and deep neural nets. After that, the two main sections of this talk provide more details, what model compression entails and the tools available in Matlab and Simulink. We'll do so for machine learning first and then deep learning and in the context of building an intelligent hearing aid. Emelie, I think some in the audience might wonder what is edge AI. Can you explain?

Well, just as in the case for the customer who called me, it's typical that it's a data scientist who develops the models. And in many use cases, they can put the models in the cloud where memory and compute is almost limitless. But with edge AI, the models live on limited hardware where computation has to happen locally. Sometimes the embedded software engineers are given these models to implement and they'll quickly see that the models won't fit. So they probably then go back to the data scientists and inform them about that. And they'll have to decide who owns making the AI model smaller. This ownership differs from company to company but it's definitely helpful to bridge these two worlds.

As the user you talked to, Emelie, was mentioning, adding hardware just to accommodate large AI models is highly undesirable because it adds cost to mass production and may delay time to market for things like recertifying hardware. So the challenges of fitting AI are these two: building performance models that are small, and second, knowing how to do it. Next, Emelie will introduce a structured way of developing an AI model with hardware constraints in mind.

All right. So let me take you through the workflow. It starts with fully understanding the limitations of the hardware where the AI model will be deployed to. This includes understanding the constraints on memory and compute power. The step after that is to select a model that fits the complexity of the problem but has the smallest size. Once the model is selected, you simplify that model by reducing features, tuning the size related hyperparameters, and removing weights if you have a neural network.

It's an iterative process. So you'll go through this workflow for many different models to find which one is smallest with the highest accuracy. When the model is as simplified as it can get, you will simplify the representation of the parameters to a fixed point format commonly known as quantization. Now you should be ready to deploy and integrate the model on a limited hardware. We'll dig more deeply into these three steps in the talk. And now I'll let Bernhard show you the compression steps for classical machine learning.

Great. It is well known that you can achieve higher accuracy with more complex models. While classic machine learning models don't get as huge as deep neural nets, even classic machine learning can become too big for resource constrained hardware. Decision trees and linear models tend to be small, whereas Colonel SVM and ensembles can grow rather big since they scale with the number of learners in the ensemble or support vectors. Not quite as huge as deep neural nets, but still. Though neural nets with just a few layers can be quite accurate while very compact in size. So they are an attractive alternative.

So unless your data allows you to get away with one of these simple models, you'll have to work on making a more complex model fit on your hardware. That's why our size gauge here is still in the red at this point. So let's look at what the steps mean. What tools does Matlab have for model compression? The first step is size of a selection of your initial models, and you can do that interactively within our learner apps, our visual model building environment.

To simplify the model, the second step, you can perform automated feature selection, also inside our learner apps while more advanced steps of model size optimization needs to be performed on the command line. Our demo will apply Bayesian optimization to optimize accuracy while observing a size constraint. Finally, the third step, quantizing model parameters, that takes representation from 64 bits and double to single, half, or even fixed point representation.

In Matlab, Fixed Point Designer allows you to do so interactively. Or if you integrated your model into Simulink, our native machine learning blocks, you can do it with a simple mouse click. So hopefully now, the size gauge is in the green and you're ready to deploy. Let's see this in action in a demo. But first, let's understand what we are doing. We are building a model to classify all the environment for an intelligent hearing aid. Consider the two environments shown. Would you want your hearing aid to pick up any sounds in both? Clearly, for the conversation with colleagues, you'd rather just hear them, whereas sounds from behind you distract you, while in the forest, you want to hear everything around you.

To empower the hearing aid to adjust its behavior accordingly, we'll build a model that classifies the acoustic environment as one where you listen to just what's in front of you versus one where you want to listen all around. And the hearing aid can switch its mode of operation accordingly. Chips on hearing aids are quite small, as little as less than 1 kilobyte. For our demo, we'll target 50 kilobytes. The data set we are using distinguishes 15 types of scenes, including forest and cafe, but also beach, office, and others, and contains up to 300 samples each. So next, we'll go into actual demo.

To get us started, we already loaded the data. 1200 in the training set. With machine learning, you can get away with fairly limited data in contrast to deep learning. Before we dive into the specifics, let's look at what compression accomplishes on this task. This chart shows the average size across all models after the various stages in the model compression workflow. Quite an impressive reduction. Now let's see how we got there. To get a sense of the data, we can interactively apply k means with a clustering life task. Turns out, the optimal number of clusters among the 15 scenes is two. So that provides support for reframing the task as a two class problem.

To save time, I'm skipping training the initial model distinguishing 15 scenes. Instead, I loaded a session where I previously trained a bunch of two class models. Let's try a linear support vector machine. It lands in the middle of the pack while the leaders are currently a linear and k nearest neighbor model, and the three based models have significantly lower accuracy. We are at step one of the model compression workflow now. Select initial models being aware of their size. To determine the size of these initial models, we export them and look up the space that the parameters occupy. Turns out, the linear SVM with more than 500 kilobytes is in the middle of the pack. But that's still 10 times larger than our target hardware can afford.

So let's get to work making it smaller. Next we proceed to step two of the workflow, simplify the model. First, we reduce the overall number of model parameters by selecting a subset of features. Next step is tuning hyperparameters for model size. You could tinker with size relevant hyperparameters manually. Instead, we'll apply Bayesian optimization and the size constraint using the base of function, which needs an objective, and we call that optimize SVM for this example.

Given data and specific hyperparameters, this function trains a kernel support vector machine. And we use the number of support vectors as estimate of its size. The optimization does its job, as you can see in these charts, and we end up with a model that does meet our size target and is very accurate. That's great. Now we can integrate this model into the larger system, other parts of the intelligent hearing aid. Simulink enables you to represent different components of a complex system and generate production ready code that runs on target hardware.

For this demo, our Simulink model only feeds pre-recorded test audio via feature extraction to the classifier, which is represented by this native machine learning Simulink block. Other popular types of machine learning model are supported. As third and final step in the model compression workflow, we apply parameter quantization. Open the machine learning block and you can switch the data type from double to lower precision, like single, with a simple button click. Let's verify this quantized model still classifies our test sample correctly. And it does. Audio from a cafe still is classified as having the hearing aid listen directionally in front of you. Of course, for your production model, you will want to verify accuracy more systematically on an actual test set.

Let's revisit the impact of model compression. Applying all four steps on this task reduced model size by over 95%, or effect of 20, and we met the aggressive target set even with the average of models, whereas the best model very significantly smaller, less than 10 kilobytes. Regarding accuracy, you will typically incur a modest loss with each compression method. On this task, we lost little accuracy overall because task simplification significantly increased accuracy. You can find the details in our handout. Next, Emelie will walk you through compressing deep neural networks.

Yeah, so let's start with model selection. AI research has had the tendency to not care too much about the size of the models that are being developed, and they can therefore be quite bloated. In this chart, we have accuracy on the y-axis and the relative prediction time on the x-axis. The size of the bubbles represent the memory footprint of that specific model. So as we can see here in the chart, a network can have the same accuracy as another network at the same time as it's both faster and has a smaller footprint.

Just look at MobileNet V2 and VDG 16 here. What we see is that a smaller, more efficient architecture can be just as performant as a larger architecture. gauge gate Bernhard introduced is still pointing at red for deep learning. So let's look at the other steps on how to make a network smaller. Step two and compressing deep neural networks is pruning. Pruning is a structured process where we evaluate the importance of the weights in a trained network to try to find the weights that are less important so that we can completely remove them. For convolutional neural networks, we remove entire filters, which you can think of as groups of weights.

After the pruning process is done, we retrain the new prune model with a data set that was used to create the original trained network. The pruned and retrained network will have a smaller memory footprint but also faster inference because there will be less mathematical operations to conduct. So after pruning, our gauge is now closer to pointing at small. For the third step for deep learning, we can use the deep network quantizer app. And after quantization, we should now have a small enough model for deployment.

So let's move on to the deep learning demo. We'll take the same data set as Bernhard showed, but instead of having a binary problem, we'll classify 10 different classes. So we have a more difficult problem than before, and that calls for a more complex model, which is why we will be choosing deep learning. So for the three steps that I explained earlier, I will in this demo show you how to select a model in the Deep Network Designer app, how to use tailor pruning on that model to be clever in how we remove the less important filters, and finally, how to quantize the network from 32-bit to 8-bit. Now let's dive into the demo.

We'll start with step one, selecting model. The data that we're going to use is spectrogram data so that we can play around with different models that are made for images. I'm going to open up an interactive tool called Deep Network Designer. We can see here that we have a couple of pretrained networks, but I'm going to import our own network called Trained Net. Inside Deep Network Designer, we can see the structure of the network. We can see that it starts with an image input layer. We can also see that it's a classification problem and that we have 34 layers in our network.

Now let's look at the accuracy of this initial model. We're going to use that accuracy as a baseline when we later on prune and quantize the network. The accuracy is at 82.7%. We can quickly look at the confusion chart to see if there are any classes that we have challenges with. Beach and home seems to be those types of classes that are challenging. Or for example, misclassifying home for being office, which might have been correct the last couple of years, but I don't believe that's represented in this data.

And now let's go into step two, pruning of the network. We're going to configure some pruning options, set up the tailor pruning network, and then start the pruning loop. We're going to speed forward here in the beginning because the first few steps we don't have any pruning of the filters. But then when the pruning gets started, we can look at the blue line, which is the accuracy, and see that as we prune more and more filters in the convolutional neural network, the accuracy actually goes down.

When the pruning is done, we retrain the network with entire data set to regain some of that accuracy that we lost. Let's visualize how many filters we actually removed. So you can see here that for convolutional layer one, three, and seven, we had the largest removal of filters, or the heaviest pruning, you can say. Now let's look at the accuracy that we got from retaining the prune model. We can see that the accuracy is now at 81.3% instead of 82.7, as it was before. We can also look at the confusion matrix again, because pruning can unevenly affect the classes and introduce a bias.

So let's see if there are any classes that have taken a toll due to the pruning. It seems like beach and home are still challenging, but cafe and restaurant is the one that has decreased the most. And we're actually starting to misclassify it for grocery store, which is something that we would need to look at more closely and try to remedy if we were to use this model. In the network metrics, we can take a closer look at how much we ended up pruning the network and its effect on memory size.

I mean, the results are very promising. Just look at that large decrease in number of learnable parameters and the consequent results in the decreased memory. We started with a memory from the original network at 4.5 megabytes, and after the pruning, that number is at 3.6 megabytes. So all in all, the network memory, decreased by 20% whilst the accuracy only decreased by 1.8%. That's impressive. And now let's take this pruned network and move it into step three, which is quantifying it. I'm going to open up another interactive tool called Deep Network Quantizer. I'm going to select the prune network that we have in Matlab workspace.

We have here selected the data store that we wanted to use for calibration. And the app displays the computed statistics, such as min and max values, for the weights, the biases, and the activations of each layer in the network. If we look at the validation results, we can see that the memory taken from the learnable parameters decreased by 75% from the prunable size we started with to the prunable and quantized size, whilst the accuracy actually slightly increased. This leaves us with the total accuracy decrease at only 1.5% for the whole compression. We can end by generating code for the neural network. And Bernhard will talk more about this later.

So let's take a recap on the final results from the compression. The memory footprint went from 4.48 megabytes with the initial model all the way down to 0.89 megabytes with the pruned and quantized model. This is a size reduction of factor five, and all of this happened with only a loss of accuracy at 1.5%.

In, closing let's discuss how you convert high level AI code to low level code required and can run on your hardware targets. With MathWorks, code generation bridges that gap. You only need one code base in Matlab or Simulink, and code generation automatically translates that to low level production ready code that can execute directly on your hardware. Here you see the top vendors for each class of hardware from embedded GPUs on the high end for autonomous vehicles, down to microcontrollers for low powered embedded devices.

So what are the conclusions of this talk? The main conclusion is that you can indeed fit many types of AI onto resource constrained hardware, be it limited memory, power, or inference speed. The demos illustrated that we have interactive tools that can make the compression process easier. And after that process, the sofa should fit just fine. The workflow is similar for both machine learning and deep learning, just the tools you need are different. In the poll we are about to open, you can share which constraint is more challenging to you.

As we are now transitioning to Q&A, here's a couple of examples of our customers have deployed intelligent systems to the edge and some links of documentation videos to help you get started. You can find these in our handout. So if you remember one thing from this presentation, we want it to be this: smaller AI models can be better than large ones. Thank you for your attention.

View more related videos