Practical Reinforcement Learning for Controls: Design, Test, and Deployment - MATLAB
Video Player is loading.
Current Time 0:00
Duration 34:50
Loaded: 0.47%
Stream Type LIVE
Remaining Time 34:50
 
1x
  • descriptions off, selected
  • en (Main), selected
    Video length is 34:50

    Practical Reinforcement Learning for Controls: Design, Test, and Deployment

    Overview

    Reinforcement learning has been gaining attention as a new control design method that can automatically learn complex control and realize high performance. However, reinforcement learning policies often use deep neural networks, which makes it difficult to guarantee the stability of the system with conventional control theory.

    In this session, we will introduce ideas on how to use reinforcement learning for practical control design with MATLAB and Reinforcement Learning Toolbox. We will cover some of the latest features available in the tool and we will also introduce a complete workflow for the design, code generation, and deployment of the reinforcement learning controller.

    About the Presenters

    Emmanouil Tzorakoleftherakis is a senior product manager at MathWorks, with a focus on reinforcement learning, deep learning, and control systems. Emmanouil has a M.S. and a Ph.D. in Mechanical Engineering from Northwestern University, and a B.S. in Electrical and Computer Engineering from University of Patras in Greece.

    Naren Srivaths Raman is a senior application engineer at MathWorks, with a focus on reinforcement learning and model predictive control. Naren has an M.S. and a Ph.D. in Mechanical Engineering from the University of Florida, and a B.E. in Mechanical Engineering from Anna University in India. 

    Recorded: 29 Jun 2022

    Hello, everyone. Welcome to this webinar on practical reinforcement learning for controls. My name is Emmanouil, and I am a product manager at MathWorks focusing on the reinforcement learning control area. I've been with MathWorks for about five years now, and I'm working closely with the development team to help guide the roadmap of our tools and I'm also providing back customer feedback.

    Today, I have with me Naren. Naren do you want to introduce yourself?

    Sure. Hi all, and my name is Naren, and I'm an application engineer at MathWorks focusing on reinforcement learning and model predictive control. I've been with MathWorks for about a year and help customers using RL and MPC be successful with our tools. And I also work closely with the development team.

    Awesome, Thanks Naren. So today, we will be covering the use of reinforcement learning for control applications, and in particular, we're going to talk about how we can design, test, and deploy reinforcement learning systems in practice with MATLAB reinforcement learning toolbox and other MathWorks tools. With that, let's start with a short refresher on what reinforcement learning is.

    So reinforcement learning is a type of machine learning that trains something called an agent through repeated interactions with an environment. If you think of machine learning in general, pretty much reinforcement learning is a subcategory of machine learning, along with unsupervised and supervised learning. With the key difference here being that training data is generated as we go through interactions with an environment. So this is in contrast with unsupervised learning, where you typically have a data set at your disposal available pretty much before training.

    So Emmanouil, there's a lot of confusion around deep learning and reinforcement learning. So can you clarify where deep learning fits into this picture?

    Sure thing. So deep learning actually spans all three classes of machine learning I'm showing here. It's also important to remember that deep learning and reinforcement learning, they are not mutually exclusive, and in fact, for complex applications, you might want to use deep neural networks to solve your problem, which is something that's typically referred to as deep reinforcement learning.

    All right, let's look at some common RL concepts through an engineering example and specifically a self-driving car. In this scenario, the vehicle's computer would be the equivalent of what we call an agent in the reinforcement learning. So now, the agent is reading measurements from sensors. You could have LIDAR sensors, you could have a camera, and so on. These sensors would represent the state of the environment.

    So again, in this scenario, that would be the position of the vehicle, if there are any other vehicles around, the road conditions, and so on. Now, based on these observations, the agent generates an action using its current policy. A policy is pretty much similar to a controller here. And the actions of that policy could be things like steering the wheel or braking or whatever makes sense for the application.

    Now, after taking an action, the agent will receive a reward back from the environment, and that reward will specify how good or how bad that action was. The reward in this case could be something related, for instance, to fuel efficiency or driver comfort or whatever make sense again. And the purpose of training is to find a policy that collects as much reward as possible.

    So at the end of the day, the training algorithm will modify the parameters of the policy using the reward information it's getting and the cycle of taking actions, collecting rewards, and updating the policy. It continues until the agent learns which action provides the highest long term reward at each state. Now, you may have noticed that reinforcement learning so far looks a lot like a control design method, and in fact, it does have many parallels to control design.

    So the policy in a reinforcement learning system can be viewed as the equivalent of a traditional controller. The environment would be the equivalent of the plant. Same goes for observations and measurements, actions and manipulated variables. The reward in reinforcement learning is similar to a cost function in optimal control let's say or let's say there could be the equivalent of the error from some desired control objective.

    And then finally, the training algorithm itself you could think of it as being similar to an adaptation mechanism that changes the weights of a traditional controller. So that was a short refresher on how RL works. If you're interested in learning more about it, we have free resources available on our website like Tech Talks with Brian Douglas. We have ebooks, demo videos, webinars.

    You can take your time and check all these resources out.

    So Emmanouil, that was a good refresher here, but how do you actually set up and solve a reinforcement learning problem?

    So the good news is that, regardless of the type of problem you're trying to solve, you can pretty much follow the exact same workflow. And the first thing you want to do is decide on how you want to represent your environment. Again, the environment as we explained previously is pretty much everything outside the agent, and it could be either a real physical system or a virtual model. Now, obviously as you can imagine, a technique like reinforcement learning, which relies on trial and error, can be risky to apply to a real physical system, so simulated model simulation models are a safe starting point.

    And once you have a good enough policy trained against a simulated model, you could always fine tune it against the real one. The next step in the workflow is coming up with a reward signal. As I explained earlier, the reward is basically just a number that tells the agent how good or how bad an action is. One thing to keep in mind here is that coming up with a good reward signal, a good reward function can be quite challenging, and it may take a few iterations to get it right.

    Next step that we want to do here is create the agent, and this steps this step includes various things like deciding on which training algorithm you want to use and at the same time coming up with a policy representation. So for example, if we're using neural networks, we need to decide how many layers we want to use, what type of layers, how many hidden nodes, and so on. And then with these steps out of the way, we can start the training process.

    Now, you may be aware that reinforcement learning is a pretty expensive technique computationally. So most of the time, you're going to need a large number of training episodes or simulations, depending on what you use to even get a decent policy. The good news is that this is where things like parallel computing and GPUs are coming to the picture, and they can help you accelerate the training process. But even if you do use those personalization techniques, training could still take anywhere from minutes to hours or even days to complete, depending on the problem you're working on.

    So now after training converges, we need to make sure that the trained policy performs as expected. We're going to talk about different ways of doing that in the example later on. But things like running multiple simulations, software and processor and doing the loop tests, all of these things are certainly relevant here. And then finally, once we are happy with the trained policy, we can take it and deploy it to the target hardware.

    Emmanouil, so one thing I want to say here is that oftentimes the policy might not perform as expected during testing. And in that case, you will need to go back and make changes to the problem formulation and then train again. You might have to change let's say the structure of the neural network, maybe try a different reward signal, and so on. And then train the agent again until you're satisfied with the result.

    All right, so now we know what we need to do to set up an RL problem, but Emmanouil, how do we actually solve it?

    Great question. So the good news is that we have a dedicated toolbox for reinforcement learning, which allows you to go through all those steps of the workflow we just went through using MATLAB and Simulink. So reinforcement learning toolbox includes popular training algorithms like DQN, DPDG, Soft Actor-Critic, PPO out of the box, but you can also create your own custom algorithm if you wanted to.

    You could train against environment models created in MATLAB and Simulink. You could use layers from deep learning toolbox to represent neural network policies. You can paralyze training. We also have an app called Reinforcement Learning Designer, which lets you do pretty much everything I mentioned above here interactively through a GUI.

    You can deploy trained policies with our various coder tools, and then finally, the toolbox includes the reference examples that can get you started. All right, with that, let's now take a look at how we can design a reinforcement learning system for control applications. Naren, over to you.

    Thank you, Emmanouil. So there are many ways to design a reinforcement learning system for controls, and one potential control architecture could look something like this. Where we have our planned and a reinforcement learning agent, which acts as an end-to-end decision making system. But then that could be a few challenges associated with such an approach.

    First thing to note is that for complex control problems, the most popular choice to represent the policy is with neural networks. As they are great functional approximations and they allow representation of complex policies. But the problem is that neural networks are effectively black boxes, and we know there are no formal methods to verify and guarantee performance of a neural network. Moreover, learning such an end-to-end policy can be hard, as the agent needs to learn different aspects such as perception, planning, and control.

    All right, so then what are our options here? One approach is to divide the job of the RL agent into smaller elements such as perception, planning or scheduling, and control. So then the control architecture could look something like this, where the RL agent is responsible only for planning or scheduling, while traditional methods like PID, LQR, or MPC is used for low level control.

    And the perception element can be done using other methods like Kalman filters, machine learning, et cetera. Advantage of such a hierarchical approach is that it is easier to train, as RL is responsible only for planning or scheduling. And moreover, we can avoid using RL for critical low level control and instead use traditional verifiable methods for it. All right, that was one approach.

    Another design could be something like this, where we use a verifiable or legacy controller as backup. So one reason to do this is that we can switch to the backup as needed, which could be because of, let's say unexpected behavior from the RL controller or if you are performing a safety critical task and we want to use a verifiable method for it. Third design is to use a post-processing module, which modifies the manipulated variables sent by the controller.

    As an example, you could use the constraint enforcement block from Simulink control design tool, which is shown here. It solves a production problem to enforce certain action constraints or behaviors, and this architecture allows you to enforce certain behaviors without having to retrain the agent. Another approach is to use UDL with traditional controls as shown here.

    Traditional controllers like PID, LQR, or model predictive control can act as the main verifiable controller, while RL fine tunes the main controller's output. So one advantage to such an approach is that the RL agent could use complex observations like images, while the traditional controller uses signal measurements.

    Naren, let me clarify one thing here. What we've talked about so far is by no means an exhaustive list of architectures. There could obviously be others that fit the bill, and you could also mix and match these designs we talked about as needed for your specific application. And you will actually see an example of that in the example that we will show in a bit.

    That's a great point, Emmanouil, and all right, we'll now go through an example to show how you would use reinforcement learning toolbox and other MathWorks tools to design test and deploy an Ar system. We'll use the rotational inverted pendulum by Quanser as our target platform, which is what is shown here. The control objective is to swing up and balance the pendulum as shown in these videos.

    So for this example, we'll be using a control architecture, which is somewhat similar to the hierarchical approach we showed earlier. So it consists of an inner loop controller, which uses PID control, and two RL agents in the outer loop. One is a swing up planner, which sends reference signals to the inner loop controller, and the other is a mode select scheduler, which switches between the reference signal of pi and the signal from the swing up agent.

    The idea behind this architecture is to use the swing up agent to bring the pendulum close to the upper equilibrium and then have the mode select agent to switch the reference signal to pi, where it is easier for the inner loop controller to perform final stabilization of the system. All right, now, we'll be going through different steps of the RL workflow for this example. The first thing we need to do is create the environment.

    We'll start with the simulated environment for all the reasons that Emmanouil mentioned earlier, such as safety, it's easier to train, and so on. Reinforcement learning toolbox lets you train against models created in Simulink or MATLAB. So let's see how that works in our example. Our model consists of three parts, outer loop control, inner loop control, and environment model.

    The environment model consists of two subsystems, DC motor model and physical model of the pendulum. We use blocks from Simscape Electrical, Simscape Multibody, and Simulink Blocks to model the environment. Next, we'll design the rewards for our agents. Reward for the mode select agent consists of two parts.

    First part is to keep the motor angle close to 0 degrees and we provide additional sparse reward if the pendulum is close to the upward position. The word for the swing up agent consists of four parts. First part is to keep the motor angle close to 0 degrees, next is to lift the pendulum to around 180 degrees, third is to keep the velocity of pendulum small, and we provide additional space reward if the pendulum is close to the upward position and motor angle is not larger than 90 degrees.

    Naren, let me add one thing here. So in this example, we came up with the reward signals ourselves. However, if you have, let's say, an existing controller in place like a model predictive controller, for instance, Reinforcement Learning Toolbox lets you automatically generate a natural reward function using the cost and constraint specifications of your existing MPC controller. You could also generate a reward signal from performance constraints defined in model verification blocks from Simulink design optimization. And you can use this automatically generated reward function as a starting point. You can start there and make changes as needed until your training converges.

    Thanks, Emmanouil. So back to our example, we have now created the environment, and let's see how to import it into the Reinforcement Learning Designer app. Note that since we have two agents that we want to train. We have two environments that we need to import. You can open the app from the command line or from the MATLAB tools strip. I'm opening it here from the MATLAB tools strip.

    All right, so this is the Reinforcement Learning Designer app, and the app lets you import environment objects from the MATLAB workspace or you can select from several predefined environments. We have several of them for both MATLAB and Simulink environments. For this example, we'll be importing the custom Simulink environments we created for training the swing up and more select agents. You can delete or rename the environment objects from the environments pane as needed, and you can view the dimensions of the observation and action space in the preview pane.

    Next step is we need to consider how to represent the policy for the agent. We only use neural networks in this example, and we have several options we can use to create the networks. One option is we can use the Deep Learning Toolbox layers to construct the network programmatically as shown in the top right. The other option, which I really like is that Reinforcement Learning Toolbox agents provides default neural network policies appropriate to the algorithm being used.

    All it requires is details regarding the observation and action spaces as shown in the bottom right here. This gives you a good starting point to iterate upon. In the upcoming slides, we'll be using this option for creating the agents using the Reinforcement Learning Designer app. Another option would be to use the Deep Network Designer app to implement the policy network interactively.

    Or you could also interoperate with third party frameworks using TensorFlow or Keras and ONNX. All right, so now we'll be looking at how we created the policy and the agent for our pendulum example in the app. You can create a new agent from the agent section of the Reinforcement Learning Tab. And depending on the selected environment and the nature of the observation and action spaces, the app will show a list of compatible built in training algorithms.

    So here, you can see that TD3, DDPG, PPO, SAC, and TRPO may work with the selected environment because it involves a continuous action space. For this demo, we'll pick the algorithm which stands for soft active critic. The app will automatically generate a SAC agent using the default agent feature we talked about previously.

    You can adjust some of the default values for the actor and critic as needed before creating the agent. So here, I'm actually going to change the number of hidden units to 300 from the default value of 26. All right, so the newly created agent will appear in the agent's pen, and the agent editor will show a summary view of the agent and all available parameters that can be tuned.

    So for example, let's actually change the agent's sample time, and we'll also change the actor's learning rate here. All right, so now that we're done with that, so to view the actor default network, you can do that by clicking on View Actor Model on the SAC agent tab. The deep learning network analyzer opens and displays the actor structure.

    So you can see that the initialization option we chose of using 300 hidden units is reflected. Similarly, you can also view the critic default network. You can change these neural networks by importing a different actor or critic network from the workspace. You can also import a different set of agent options or a different actor or critic representation object altogether.

    All right, now back to our Simulink model. Note that we have two agents for this example, the swing up agent and the more select agent. So let's actually take a look at the swing up subsystem in here. So in Simulink, you can use the agent block from Reinforcement Learning Toolbox to link to an agent object created in the app, which is what is shown here.

    Let's also take a quick look at the mode select subsystem. So in here, you can see that the RL agent block is used here as well. And we can basically follow the same procedure that I just showed here to create the mode select agent as well. All right, next step is to train the agents. First, click Train to specify various training options, such as stopping criteria for the agent, and so on.

    So here, let's set the max number of episodes to 1,000, and then average window length to 50 and topping criteria to average reward and a value of 7,500. We leave the rest as default values. You can parallelize training if you want and the parallelization options include additional settings, such as whether the data will be sent synchronously or not and more.

    After setting the training options, you can generate a MATLAB script with the specified settings that you can use outside the app or start training here. If visualization of the environment is available, you can view how the environment responds during training. During the training process the app opens the training session tab and displays the training progress. You can stop training any time you want, and once you start training, you can choose to accept or discard training results.

    If you do choose to accept it, then a new trained agent will appear under the agent's pane, and also the accepted results will show up under the results pane.

    Naren, one thing here. So you showed how we can train the swing up agent, but how about the mode select agent?

    Yeah, so that's actually a great question. So the way we do it is, we first train the swing up agent and then we freeze the policy. Then we use that to train the more select agent using the exact same procedure that I showed here. So basically, we train these agents in series.

    All right, so the plot here, show the final training results for the spring up agent and the mode select agent. Training took a few hours to complete for both the agents without using an parallelization. You can monitor the progress of the training process with the episode manager in the app, which provides useful information like the episode reward values, number of episodes, and so on.

    All right, thanks, Naren. So now, we have trained our agents. But before we actually go and deploy them, we need to make sure they perform as expected. The first thing we can do pretty easily is to test the trained policies in simulation using our Simulink model. As you can see here in this video, the swing up agent successfully brings up the pendulum close to the upward equilibrium and the mode select agent successfully switches between swing up and fixed reference, which is what we wanted to see.

    The next thing we may want to try is test the performance of the generated code. So first, we're going to run a so-called software in the loop test. Software in the loop test, or SIL test will generate C code from the controller, which includes the RL policies, and then it will execute this code as a separate process on your host computer. Next, we're going to run a process on the loop test or PIL test, in which case, we will also generate code from the controller but the code in this case will run on the target platform we want to test against.

    And in this case, it's going to be a Raspberry Pi. Now, in our hardware setup, the Raspberry Pi directly communicates with the Quanser QUBE. In this case here, so in the case of PIL testing, the Raspberry Pi will actually calculate the controller output and then it will send it over to Simulink to control the simulated plant.

    So Emmanouil, what are the benefits of actually doing these tests, though?

    Good question. So ultimately, we want to make sure that when we generate code from the controller, the code performance will still be acceptable. So by comparing simulation results from the normal simulation with the results from the Simulink tests, you can test the numerical equivalence of your model and the generated code, plus you can also make sure that you meet the real time constraints if you have any.

    All right, so first, let's see how we can generate code from a trained policy in Simulink. There are really only two steps involved here. The first thing we need to do is extract the trained policy from the RL agent, which we can do by calling the Generate Policy Function method. This method creates a MATLAB function that takes in observations as input and outputs actions, and then we simply need to just call this function from a MATLAB function block in our Simulink model, and that would be the only change we would need to make on the reinforcement learning side.

    Now, to run a SIL test, we will first open the controller subsystem and in the model settings, we need to make sure that we use a fixed step solver. Now, once we do that, we also need to make sure that we select the simulation target language to be C++, along with the MKL DNN, and library as our target library. And then finally, under the code generation tab, we can select the Simulink or embedded coder TLC, as well as the target language, which is C in this case.

    Next, in the top level of the top level model, we can set the simulation mode for the controller subsystem to be software in the loop. And finally, we can run the SIL test from the SIL/PIL tab. The process to set up the PIL test is actually very similar to this one. So in the interest of time, we will skip that. Now, here are the results of the SIL test.

    As you can see, the results of the model and the generated code are numerically equivalent, which is what we wanted to see. And here are the results of the processor and the loop test of the PIL test. Again, as you can see at the top left, the code generated by the controller is still successfully controlling the system, and in this case, we also did some profiling on the Raspberry Pi, and we found that running inference on the trained policies takes approximately 2.68 milliseconds on average and the respective CPU utilization is pretty low at approximately 13%, which, again, is great.

    So now that we are happy with our test results, we can go ahead and deploy the controller to our hardware system, and as you can see in this video, the controller does a pretty good job of swinging up and balancing the pendulum.

    That actually looks pretty great. So one question, what if the deployed policy does not perform as expected?

    Great question. So even for this example, the trained policy does not always perform great, and this is actually not surprising, as there are obviously modal discrepancies between the simulated environment that we trained against and the actual dynamics of the hardware. There could also be sensor noise involved or other environment circumstances that we did not account for during training and so on. So for these reasons, it's possible that you may need to do some additional training against the real hardware system.

    So the idea here is you can start by training with a simulated model to get an initial policy in place, and then you can find on this policy by training it against the real physical system.

    OK, but is there a way we can limit the additional training needed?

    Sure. I mean, the more parameters you randomized during training, the more robust the training policy is going to be. Although that will obviously prolong the training time as well, right? So you can randomize things like initial conditions, parameters of the dynamics, and environment conditions, and this is actually a good practice in general to follow as you're setting up your problem.

    OK, so that was the end of the example. Hopefully, you got an idea on how to design, test, and deploy reinforcement learning for controls using our tools. One thing I want to mention here is that since the introduction of Reinforcement Learning Toolbox, we have engaged with multiple customers who have been using our reinforcement learning tools so we have engaged with Vitesco technology who applied reinforcement learning to speed up the development of power train controllers, Cummins, who used RL to enhance and augment classical controllers for automotive systems, and also Lockheed Martin, who applied RL to expose vulnerabilities in 5G systems.

    So as you can see from this last example, reinforcement learning is also used in areas outside of controls. Specifically, RL can be used in a variety of decision making problems, such as scheduling, resource allocation, calibration, and so on. And also in many application areas like robotics, automated driving, wireless system cybersecurity, and the list goes on. Reinforcement Learning Toolbox provides several reference examples for a lot of these areas to get you-- to help you get started quickly.

    Last but not least, here are some additional resources. We have a free reinforcement learning ramp that is self-paced and can teach you basic concepts as well as how to use a Reinforcement Learning Toolbox. We're also providing links over here to the examples that we showed today in case you want to take a closer look, and that concludes the webinar.

    This is all we had. Thank you for your attention. We can now move to the Q&A portion of the webinar. If you have questions, please type them in the Q&A panel. We'll take a few moments to review them, and we will then come back online for answers. Thank you very much.

    View more related videos