An Introduction to Multi-Agent Reinforcement Learning
From the series: Reinforcement Learning
Learn what multi-agent reinforcement learning is and some of the challenges it faces and overcomes.
You will also learn what an agent is and how multi-agent systems can be both cooperative and adversarial. Be walked through a grid world example to highlight some of the benefits of both decentralized and centralized reinforcement learning architectures.
Published: 13 Jul 2022
In this video, I want to briefly introduce you to multi-agent reinforcement learning and, hopefully, provide you with some intuition into what it is and some of the challenges it faces and overcomes. So I hope you stick around for it. I'm Brian, and welcome to a MATLAB Tech Talk.
All right. Let's start with what a multi-agent system is. An agent is something that is autonomous in the sense that it can observe the environment and then, on its own, choose how to act based on those observations. An agent could be something that physically moves, like a robot or a vehicle, or it could be something that doesn't, like a controller in a larger software or logistical system. As long as the entity has the ability to autonomously react to observations, and its actions change the state of the environment, then it's an agent.
Now, a multi-agent system is one in which there are multiple of these autonomous entities sharing a common environment. Think of a swarm of robots that operate in a formation or several autonomous vehicles driving through the same intersection or a network of distributed controllers that are trying to accomplish a larger goal-- for example, several smart homes that are each trying to schedule power-consuming activities, like car charging, in such a way that balances the power load throughout the day, rather than just at peak times.
Of course, what I just described are different types of cooperative agents, where they're trying to work together to reach a goal. But multi-agent systems could also be adversarial, where the entities are trying to maximize their own personal benefit while trying to minimize their opponents'. This is the case for games and other types of competitions. Also, a multi-agent system could be a mix of both adversarial and cooperative agents.
Now, one way to design a multi-agent system is that we, as humans, determine ahead of time what we want the agents to do in different situations, and then we impart that knowledge directly into the agent. For example, for autonomous vehicles, we could write code that already understands that it needs to stop at red lights and that it should maintain a minimum distance from the car in front of it and that it should stay between the road lines. So basically, through software and physical design and procedures, we are defining how the agent should act, given a particular environment state.
However, in contrast to explicitly encoding behaviors, we could also give the agent the ability to learn some or all of its behaviors on its own. And there are many different types of learning algorithms, like, for example, adaptive control and genetic algorithms. But for this video, we're going to be talking about reinforcement learning.
Now, we've covered reinforcement learning in several other Tech Talk videos that I've linked to below. So for this video, let me just quickly summarize what it is. An agent exists within an environment. The agent observes the environment state, and then the agent's policy automatically determines which action to take. This action affects the environment state, and there may be a reward granted based on the state-action pair, although in general this reward may be sparse and only received after many sequential actions.
And the goal of the reinforcement learning algorithm is to update the agent's policy over time in such a way as to maximize the reward. So the idea behind multi-agent reinforcement learning, or MARL, is that we have multiple agents interacting with an environment, and each of those agents are using some form of reinforcement learning to update their policy over time.
Now, as soon as we have multiple agents that are each learning and interacting with each other, we start to introduce a few challenges that we have to overcome. And to give you a sense of just a few of those challenges, let's set up this toy example. We are going to use a grid world for this example. This room has three robotic vacuums that we want to vacuum the entire room in the shortest amount of time. That is, we want to cover each of the white squares at least once.
And the dark squares represent the locations of walls and obstacles. So basically, the vacuums can't occupy these spaces. At each time step, each vacuum observes where it is in the room and where the obstacles are, and it also maintains a coverage map of where it has been.
So if we just look at the green vacuum, it has access to this information. Its position is in the corner, and it has just covered that spot so far. Now, using this information, the green vacuum determines one of five actions to take-- move up, down, left, or right, or just wait in its current spot until the next time step. And so, obviously, the idea would be that, for each time step, each vacuum moves to a square that hasn't been covered yet.
But the vacuums aren't programmed ahead of time with knowledge of which actions are the best to take and instead use a reinforcement learning algorithm to learn the optimal policy over time. And it does this by maximizing rewards. A positive reward is given for moving to a previously unvacuumed cell, and then a negative reward is given for an illegal action, like requesting the robot drive into an obstacle or another robot.
Plus, there are small negative rewards for moving to a covered square and also for not moving at all. And then, finally, there is a very large reward if the entire room is covered. And here's the key part-- that award is shared across every agent once the entire room is covered. So these agents are cooperating with each other since they have this shared reward. Through many episodes of trying actions and collecting rewards, the goal is that they will learn how to vacuum this room efficiently.
So let's think about a few scenarios for how we can approach MARL. For the first, we're going to look at a decentralized architecture. In decentralized learning, each agent-- or each vacuum, in this case-- is trained independently from the others. In other words, each vacuum learns to cover as much of the room as it can without any regard to what the other vacuums are doing.
With a decentralized architecture, no information is shared between the agents. They are completely on their own for learning. And this has some benefits. For one, since no information is shared, these vacuums don't need to be designed to communicate with each other, which simplifies the overall system.
However, at least in this example, there are also a few obvious drawbacks to this method. The first is that, since the vacuums don't know which parts of the room the others have covered, then they can't learn how to avoid those areas. For example, here the green robot is in this situation where it thinks that there is an uncovered spot to its right and one to its left. And so if it takes an action to the right, then it should receive a positive reward since it hasn't been covered yet.
However, if that spot was actually covered by a different vacuum, then it would be penalized instead. So it's going to want to learn to avoid this situation. However, the problem is that, since the green robot only has access to its own coverage map, it can't actually learn how to distinguish between an unvacuumed square and one that was covered by a different vacuum.
Now, one possible solution is to have each of the vacuums just share their coverage map with each other. That way, they will have enough information to ultimately learn to move into locations that haven't been covered by any of the vacuums. And again, this requires additional hardware or software that allows for this information sharing. So there is a trade-off that needs to be considered between performance and overall complexity of your project.
So sharing the coverage map is definitely a better approach for this example than just having each vacuum work completely independently from each other. However, it still might not be the best approach because we actually have another problem. When we have multiple agents, each that are learning and changing their policy, this makes the environment nonstationary, meaning that the underlying Markov decision process changes over time. And the reason that this is an issue is because many reinforcement learning algorithms expect the environment to be stationary. Otherwise, they're chasing a moving target.
In the case of multi-agent reinforcement learning, we end up with a situation where each agent is trying to learn what the other agents are doing. But at the same time, those other agents are also learning and changing. So they're all continuously reacting to each other's policy changes, and they may never converge on a solution.
To get a better understanding of what I mean here, let's look at another scenario. Here we have two agents that are in this state. They are both surrounded by vacuumed areas, except for the center spot. Let's say that the green vacuum is expecting the red vacuum to just wait or to move in a different direction, and therefore, its policy has learned to try to occupy this space when the environment is in this state.
However, the red vacuum might have a similar expectation and also try to occupy the center space. This is an illegal action, and both agents would be punished with a negative reward. Now, if only one agent is learning, then, in the next episode, that agent will learn to wait when it's in this state since the other vacuum will cover that cell, and they're going to both receive a larger reward.
However, if both vacuums are learning, they both might not move in the next episode, since they expect the other to move instead. So this is a nonstationary environment. The agents can't keep up with what the other entities are going to do, since they are changing their policies at the same time. So we can end up with situations where the solution never converges.
Now, to be fair, due to the stochastic nature of training, it is possible that, through many episodes, these two agents learn to work together to cover this one free square. Unfortunately, there is another inefficiency that arises from decentralized learning like this. And that is, even if the green and red vacuums learn how to cover the square in this state, with the green on the left and the red on the right, they still need to learn it again if their positions are swapped because their individual policies would see this as a new state and need to figure out what to do.
However, if these two agents are identical-- that is that they can make the same observations and take the same actions-- then there is no difference in how they should behave. The robot on the right should always move to the left, and the left robot should wait, regardless of which robot is where.
All right. So a decentralized approach is simpler from an implementation standpoint, which is beneficial in some MARL situations. But it introduces other problems, like the nonstationary environment. And for situations where a decentralized architecture doesn't work well, we can use a centralized architecture.
With a centralized architecture, there is some higher-level process that is collecting the experiences of the agents and then learning a policy using all of that information, which is then distributed back to the agents. And this is especially beneficial if each agent is identical in the sense that they have the same observations in actions, because now, a single policy could be developed that would be optimal for each of them.
For example, each vacuum could observe its own position and coverage map and then store it in a common buffer. So the environment state is now the position of all three vacuums and their collective coverage map. And a centralized reinforcement learning algorithm would use that collective experience to come up with a policy that would move all three robots in the most beneficial way as a whole. In this way, it doesn't matter which robot is in which location, since they are interchangeable.
And so we've effectively reduced the amount of learning that has to take place since each robot is basically learning from all of the other's experience. And not only that. We've created a stationary environment for our agents since all of the agents are treated as a larger entity and they know about the changes in each other's policies. And this removes that situation where the agents are chasing each other's changes and therefore provides a situation where the policy can eventually converge.
All right. Well, this was a very simple introduction to MARL. And we definitely didn't cover everything, but hopefully, you can start to appreciate the use cases for centralized and for decentralized learning, as well as some of the drawbacks of each architecture. And I always find that playing around with some examples is a good way to get a better feel for how an algorithm works.
So with that in mind, I've left some links below to several different MARL MATLAB examples. Specifically, you should check out the reinforcement learning toolbox example called Train Multiple Agents for Area Coverage. I used a version of this example to illustrate the concepts in this video, AND so it's a good place to start exploring multi-agent learning in more detail.
In this example, you can play around with both centralized and decentralized learning architectures and see the learning process in action. I think it's cool to watch how the agents change from episode to episode and how they converge on collecting the most reward as a whole. And not only can you change the learning strategy, but you can also play around with the learning hyperparameters and see if you can develop some intuition into how they affect the end result.
All right, so this is where I'm going to leave this video for now. If you don't want to miss any other future Tech Talk videos, don't forget to subscribe to this channel. And if you want to check out my channel, Control System Lectures, I cover more control-theory topics there as well. Thanks for watching, and I'll see you next time.