Understanding and Verifying Your AI Models
Neural networks can obtain state-of-the-art performance in a wide variety of tasks, including image classification, object detection, speech recognition, and machine translation. Due to this impressive performance, there has been a desire to utilize neural networks for applications in industries with safety-critical components such as aerospace, automotive, and medical. While these industries have established processes for verifying and validating traditional software, it is often unclear how to verify the reliability of neural networks. In this talk, we explore a comprehensive workflow for verifying and validating AI models. Using an image classification example, we will discuss explainability methods for understanding the inner workings of neural networks. Learn how Deep Learning Toolbox™ Verification Library enables you to formally verify the robustness properties of networks and determine whether the data your model is seeing during inference time is out of distribution. By thoroughly testing the requirements of the AI component, you can ensure that the AI model is fit for purpose in applications where reliability is of the utmost importance.
Published: 5 May 2023
Hello, everyone, and welcome to this session titled Understanding and Verifying your AI Models. Verification and validation are essential processes that help ensure the safety, reliability, and effectiveness of complex systems across a wide range of industries. As AI continues to revolutionize the way we work and live, it is critical to apply V&V techniques to these systems to ensure their accuracy and trustworthiness.
My name is Lucas Garcia and I'm a product manager for deep learning at MathWorks. In this session, we'll focus on specific techniques that can help us build a better understanding of the AI models, verify them, and ensure that they meet the intended design and functional requirements. But before we get started, share your MATLAB EXPO experience on social media using the hashtag #MATLABEXPO. We want to hear your thoughts and insights about the event.
And of course, if you are into AI here are my social media handles, so let's connect and share our knowledge. Today I'll be talking about an adaptation of the classic V shaped development process to AI applications, the W shaped development cycle, and how MathWorks has capabilities addressing each area of the diagram.
I'll introduce Deep Learning Toolbox Verification Library, which is a new library we release to verify and test robustness of deep learning networks. And finally, many of you have been using our products for certification of traditional software according to industry standards. In addition, MathWorks actively contributes to initiatives focused on AI based certification such as the working group on AI navigation established by EUROCAE and SAE.
Now, we all know that AI provides the best results for many tasks. However, as AI model has become more prevalent in production environments, there is a growing desire to explain, verify, and validate their behavior. This is particularly important in safety critical industries such as health care, aerospace, or automotive, where incorrect or biased decisions can have severe consequences.
Naturally, verifying validating enabled systems comes with a wide set of challenges, including data management, traceability, robustness, requirements, and much more. This is just a representative sample of some of the challenges that may arise from having AI components in your system. Over the last couple of years, there has been significant progress across industries on verifying AI and systems through white papers, standards, and planning.
In the automotive industry there is a new work in progress ISO publicly available specification, 8200 on road vehicles, safety and artificial intelligence. In aerospace, the joint working group from EUROCAE and SAE expected to release the new process standard AS6983 in 2024, and in medical devices the FDA released its first AI and ML based software as a medical device action plan. One of the commonalities across industries is that traditional processes and workflows will have to be adapted as AI and ML are introduced.
As an example, this is the adaptation of the classical V shaped cycle to AI applications credited to EASA, the European Aviation Safety Agency, and Daedalean. Part of the effort done by this group was to try to identify how much of the existing processes around design insurance had to be changed and what could stay in place. And one of the identified areas is around having to assure-- that is, provide certainty-- on the emerging properties that come out of your neural network.
And that instead of looking at the lines of code that are being written, you have to look at your learning algorithm, and more importantly, at the data that is used during these machine learning algorithms. So it is key how you treat your data so that the performance that you measure in the lab will hold when you go out to the field.
So in this session, we'll be showing how MathWorks tools can be used throughout this W shaped workflow. Naturally, this W shaped development process can coexist and is concurrent with the V cycle that is frequently used for development assurance of non-AI components. The task we'd like to solve this to verify a neural network that performs image classification.
And examples of these type of work across some of the key industries include classifying traffic signals, such as stop signs, in the automotive industry, predicting whether the image that we are seeing is an aerial image of an airport or not, or classifying X-ray images to facilitate medical diagnosis.
Today we'll be verifying a deep learning model that identifies whether a patient is suffering from pneumonia or not by examining chest X-ray images. The model needs not only to be extremely accurate but also extremely robust since people's lives are at stake. However, it's worth noting that the techniques, workflows, and best practices we'll be discussing for this example are also applicable to the other examples we've highlighted here.
I'll very quickly introduce the data set we'll be using, which is the MedMNIST data set. If you are familiar with MNIST for digit classification, MedMNIST is a collection of already labeled 2D and 3D biomedical lightweight 28 by 28 images designed to perform image classification. We decided to use this data set because of its simplicity and the ability to rapidly iterate over the design, given it's very lightweight. And within the MedMNIST collection, we'll use the pneumonia MNIST data set.
OK, so we'll start with the very first step in the W shaped development cycle that is AI and ML related, and that is collecting the requirements that are specific to the machine learning component. Requirements toolbox lets you author, link, and validate requirements within MATLAB or Simulink, and you can create requirements using rich text and custom attributes or import requirements from requirement management tools.
And as you can see in the Requirements Editor here, I've already collected a few requirements that are related to the input and output data, accuracy, robustness, latency, and implementation. Note that for each requirement you may also add a description that better explains what that specific requirement intends to accomplish.
The next step is data management and for this data set, images have already been labeled. However, note that if you need to label your own data, MathWorks has various labeling apps, including Image Labeler or Signal Labeler. Additionally, data has already been partitioned into training, validation and testing data sets, so we don't need to worry about that either.
In reality, all we need to do here is to find a convenient way to adequately manage our images. Data stores, and in particular the image data store here, allows us to manage a collection of image files where each individual image fits in memory but the entire collection of images does not necessarily fit. Surely the MNIST images are small in nature and will really all fit in memory, but this approach allows us to see how this process will scale for more realistic workflows.
By indicating the folder structure and that the label source can be inferred from the folder names, we're able to very simply create a MATLAB object that acts as a repository of data. We can note that this data set is imbalanced towards more pneumonia samples, so this should be considered as we design the model.
Moving on to the learning process management step here, we'd like to account for all the preparatory work prior to the training phase. So we'll focus on coming up with the right network architecture, training algorithm, loss function, hyperparameters, and so on. Using Deep Network Designer, we can accomplish these tasks interactively, helping us to easily iterate over the design.
As you can see here, you can start off from one of the available pre-trained networks or you can choose to pick an existing network in the MATLAB workspace or create one from scratch, which is what we'll do in this case. Now from the layer library on the left, we can simply drag and drop relevant layers onto the canvas to design our network. We'll choose an image input layer, a few compilation layers.
Now let's speed the process a bit here, but basically each comp block will consist of compilation, bash, norm, and ReLU layers, followed by drop out and pooling layers, and then fully connected in a softmax layer that computes the class probabilities for classification. And once the network is assembled, we can go through the properties for each of the layers in network and modify them accordingly based upon the available information we have for the data and the problem we're trying to solve.
And after going through this process, Deep Network Designer allows us to analyze the resulting network architecture and understand it in terms of its activations and learnables. Note that we have over 400,000 learnables and we have no warnings or errors. And finally, we can go to the Export button, export the network to the MATLAB workspace, or generate the equivalent MATLAB code that we require to train the model.
In this step we also have to account for what the training environment would be. This may involve using the CPU, a computer with a GPU, or multiple GPUs-- maybe a cluster made of multiple CPUs and GPUs-- and possibly leveraging also the cloud. However, coming up with the optimal hyperparameters might not be so straightforward. Experiment Manager is an app that allows you to find optimal training options for neural networks by sweeping through a range of hyperparameter values or using Bayesian optimization.
The app allows us to run different training configurations even in parallel, if we have access to the necessary hardware. Moving on to model training. After obtaining a good model from experiment manager app, we'll be iterating some of the steps in the W shaped cycle. As we'll be seeing in a bit, in order to comply with some of the requirements related to robustness of the deep learning model, we'll be retraining the model using data augmentation techniques. That is, performing meaningful transformations to the data set in order to improve model performance and robustness.
Now after training the model, we know that although we might have improved some of the results with this process, we still need to further improve the robustness results. So next we'll use a training algorithm called FGSM or fast grading sign method for adversarial training, and the goal is to generate adversarial examples during training which are visually similar to the original input data but can cause the model to make incorrect predictions.
Let's move on to the learning process verification. Now this is a key step in the workflow, so I'll be spending some time here to discuss things you can do to build a good understanding of the learning process and learn about new functionality we've introduced for AI verification. Naturally, after you've trained the model-- in this case, I'll be showing the results after performing adversarial training-- you'd like to evaluate the model accuracy with an independent test set.
And in this case, the model provides more than 90% accuracy, surpassing the results seen in the original paper for similar networks. And aside from looking at global metrics, you'd probably want to also look at the confusion matrix and get insights into the sources of errors made by the model. Additionally, given an image, you may want to use techniques such as gradCAM to gain a better understanding of how the model is making its predictions.
The purpose of gradCAM is to highlight the regions of the image, of the input image that contributed the most to the final prediction. Robustness is one of the main concerns when deploying neural networks in safety critical situations. The reason being is that it has been shown that neural networks can misclassify inputs due to small imperceptible changes, small perturbations that change the output of the network. Now we will not have a trustworthy model if by changing a single pixel we get a different output, right?
So in our 2022B, we released Deep Learning Toolbox Verification Library, allowing us to verify and test robustness of deep learning networks. Let's see what this library will help you accomplish in terms of verification. Given one of our images in the test set, we can choose a perturbation that defines, let's say, a collection of perturbed images for this specific image. It's important to note that this collection of images is extremely large. This is just a representative sample and it's not practical to test each perturbed image individually.
Using formal verification methods, we can prove for the entire volume if the output to the network changes. If the output of the network doesn't change, we get verified. If the output does change, then we get the properties violated. And if we're not able to prove it, then we get an unproven result. Let's see this with some code.
We'll start with a perturbation. The image values in hexdec range from zero to one, so this is a 1% perturbation up or down. By using x lower x upper, we are defining a collection of images-- the volume we were seeing earlier, where x lower and x upper simply set the bounds of the perturbation. This means that we will test all possible perturbation of images that fall within these bounds.
Before running the verification technique, we need to convert the data to DL array with its corresponding format. Note that x test is not just a single image, but a batch of images to verify. So we have the volume to verify for each of the images in the test set. We can then run verify network robustness, providing the trained network, the lower and upper bounds, and the ground truth labels for the images.
The result is that we're getting over 400 images verified, 13 where there was a violation, and over 200 with an unproven result. Now we have to go back to those images where we got violations or improvement results and see if there's anything we can learn from this process. But for over 400 images, we were able to formally prove that there is no surrounding hydrophilic sample with up to 1% perturbation that changes the output of the network, and that's quite a strong statement.
If we had used the original network instead, we would be getting an unproven result for pretty much all our images. And in a safety critical context, you'll likely need to treat the unproven science violations. Using data augmentation techniques help significantly to get some images verified, but only after training the network using adversarial training we were able to come up with a robust network with which we could verify many more images. A trustworthy AI system should produce accurate predictions on known contexts, but it should also identify unknown examples to the model and reject them or transfer them to a human for safe handling
Deep Learning Toolbox Verification Library also includes functionality for out of distribution detection. So here we have a sample image from our test set. From this test set I'm going to, let's say, define some new test sets based upon some meaningful transformations. So we can add speckle noise and create a new test set. This is what it looks like for the sample image. Or I can also flip things left to right or increase the contrast of the image.
Now using this library you can create other distribution-- another distribution data discriminator where you can assign confidence to the network predictions by computing a distribution confidence score for each observation. It also provides a threshold for separating the in distribution from the out of distribution data. Here we're seeing the network distribution scores for the training data in blue, which is in distribution data set. And for the other transformations we made to the test set we can also see their distributions.
Now with the distribution discriminator and the obtained threshold, we can tell if we were to receive, at test time, the spectral image from the bottom left, the distribution discriminator will actually consider the image to be in distribution. And so we could trust the output given by the network for this image. On the contrary, for the other two images, the distribution discriminator considers that these images are out of distribution and so we shouldn't trust the output that the network may provide in this case.
MathWorks has a unique code generation framework that allows AI models to be deployed anywhere without having to rewrite the model in another language eliminating coding errors. Using Analyst Network for code gen, you can check the code generation support for the trained network. And as you can see, the trained network is supported across multiple CPU and GPU library targets. And also for generating code that does not use any third party library, which is what the non-value in the table refers to.
Next we'd like to move to the next step, where we'll be integrating the AI model into the larger system under design. Here we show how to integrate the deep learning model into Simulink by using the predict block. And even though this image here shows a simple test harness, in this space we'll also be integrating the model with other techniques-- for instance, tracking algorithms or the runtime monitoring system such as the out of distribution technique we saw earlier.
The next step is meant to close the data management lifecycle, ensuring the operating space has been correctly identified and that the differences between the learning and inference environment have been mitigated. And finally, machine learning requirement verification. Here we want to go through the process of performing thorough testing for the deep learning model. We can use tools such as Simulink test or MATLAB test, as we're showing here, and then we can show that the item requirements have been not only implemented, but also verified. For instance, for the requirement highlighted in blue, we can link to where this requirement is being implemented and which is the test that verifies it.
To wrap up in this session today, we have seen the W shaped development cycle, which is an adaptation of the classic V shaped development process to AI applications, and how MathWorks has capabilities addressing each area of the diagram. We've introduced Deep Learning Toolbox Verification Library, which is a new library to verify and test robustness of deep learning networks.
And a significant number of customers have been employing our products to certify conventional software according to different industry standards such as ISO 26262, IEC 61508, and DO178c. Furthermore, MathWorks takes an active role in supporting initiatives aimed at certifying AI based software, including the EUROCAE and SAE's AI navigation working group. That's all from my side. Thank you all for your attention.