Train DDPG Agent for Path-Following Control
This example shows how to train a deep deterministic policy gradient (DDPG) agent for path-following control (PFC) in Simulink®. For more information on DDPG agents, see Deep Deterministic Policy Gradient (DDPG) Agent (Reinforcement Learning Toolbox).
Fix Random Seed Generator to Improve Reproducibility
The example code may involve computation of random numbers at various stages such as initialization of the agent, creation of the actor and critic, resetting the environment during simulations, generating observations (for stochastic environments), generating exploration actions, and sampling min-batches of experiences for learning. Fixing the random number stream preserves the sequence of the random numbers every time you run the code and improves reproducibility of results. You will fix the random number stream at various locations in the example.
Fix the random number stream with the seed 0
and random number algorithm Mersenne Twister. For more information on random number generation see rng
.
previousRngState = rng(0,"twister")
previousRngState = struct with fields:
Type: 'twister'
Seed: 0
State: [625x1 uint32]
The output previousRngState
is a structure that contains information about the previous state of the stream. You will restore the state at the end of the example.
Simulink Model
The reinforcement learning environment for this example consists in a simple bicycle model for the ego car together with a simple longitudinal model for the lead car. The training goal is to make the ego car travel at a set velocity while maintaining a safe distance from lead car by controlling longitudinal acceleration and braking, and also while keeping the ego car travelling along the centerline of its lane by controlling the front steering angle. For more information on PFC, see Path Following Control System (Model Predictive Control Toolbox). The ego car dynamics are specified by the following parameters.
m = 1600; % total vehicle mass (kg) Iz = 2875; % yaw moment of inertia (mNs^2) lf = 1.4; % long. distance from center of gravity to front tires (m) lr = 1.6; % long. distance from center of gravity to rear tires (m) Cf = 19000; % cornering stiffness of front tires (N/rad) Cr = 33000; % cornering stiffness of rear tires (N/rad) tau = 0.5; % longitudinal time constant
Specify the initial position and velocity for the two vehicles.
x0_lead = 50; % initial position for lead car (m) v0_lead = 24; % initial velocity for lead car (m/s) x0_ego = 10; % initial position for ego car (m) v0_ego = 18; % initial velocity for ego car (m/s)
Specify the standstill default spacing (m), time gap (s), and driver-set velocity (m/s).
D_default = 10; t_gap = 1.4; v_set = 28;
To simulate the physical limitations of the vehicle dynamics, constrain the acceleration to the range [–3,2]
(m/s^2), and the steering angle to the range [–0.2618,0.2618]
(rad), that is -15 and 15 degrees.
amin_ego = -3; amax_ego = 2; umin_ego = -0.2618; % +15 deg umax_ego = 0.2618; % -15 deg
The curvature of the road is defined by a constant 0.001 (). The initial value for lateral deviation is 0.2 m and the initial value for the relative yaw angle is –0.1 rad.
rho = 0.001; e1_initial = 0.2; e2_initial = -0.1;
Define the sample time Ts
and simulation duration Tf
in seconds.
Ts = 0.1; Tf = 60;
Open the model.
mdl = "rlPFCMdl"; open_system(mdl) agentblk = mdl + "/RL Agent";
For this model:
The action signal consists of acceleration and steering angle actions. The acceleration action signal takes value between –3 and 2 (m/s^2). The steering action signal takes a value between –15 degrees (–0.2618 rad) and 15 degrees (0.2618 rad).
The reference velocity for the ego car is defined as follows. If the relative distance is less than the safe distance, the ego car tracks the minimum of the lead car velocity and driver-set velocity. In this manner, the ego car maintains some distance from the lead car. If the relative distance is greater than the safe distance, the ego car tracks the driver-set velocity. In this example, the safe distance is defined as a linear function of the ego car longitudinal velocity , that is, . The safe distance determines the tracking velocity for the ego car.
The observations from the environment contain the longitudinal measurements: the velocity error , its integral , and the ego car longitudinal velocity . In addition, the observations contain the lateral measurements: the lateral deviation , relative yaw angle , their derivatives and , and their integrals and .
The simulation terminates when the lateral deviation , when the longitudinal velocity , or when the relative distance between the lead car and ego car .
The reward , provided at every time step , is
where is the steering input from the previous time step , is the acceleration input from the previous time step. The three logical values are as follows.
if simulation is terminated, otherwise
if lateral error , otherwise
if velocity error , otherwise
The three logical terms in the reward encourage the agent to make both lateral error and velocity error small, and in the meantime, penalize the agent if the simulation is terminated early.
Create Environment Object
Create an environment interface for the Simulink model.
Create the observation specification.
obsInfo = rlNumericSpec([9 1], ... LowerLimit=-inf*ones(9,1), ... UpperLimit=inf*ones(9,1)); obsInfo.Name = "observations";
Create the action specification.
actInfo = rlNumericSpec([2 1], ... LowerLimit=[-3;-0.2618], ... UpperLimit=[2;0.2618]); actInfo.Name = "accel;steer";
Create the environment interface.
env = rlSimulinkEnv(mdl,agentblk,obsInfo,actInfo);
To define the initial conditions, specify an environment reset function using an anonymous function handle. The reset function localResetFcn
, which is defined at the end of the example, randomizes the initial position of the lead car, the lateral deviation, and the relative yaw angle.
env.ResetFcn = @localResetFcn;
Create DDPG Agent With Custom Networks
DDPG agents use a parametrized Q-value function approximator to estimate the value of the policy. A Q-value function critic takes the current observation and an action as inputs and returns a single scalar as output (the estimated discounted cumulative long-term reward given the action from the state corresponding to the current observation, and following the policy thereafter).
To model the parametrized Q-value function within the critic, use a neural network with two input layers (one for the observation channel, as specified by obsInfo
, and the other for the action channel, as specified by actInfo
) and one output layer (which returns the scalar value). Note that prod(obsInfo.Dimension)
and prod(actInfo.Dimension)
return the number of dimensions of the observation and action spaces, respectively, regardless of whether they are arranged as row vectors, column vectors, or matrices.
When you create the networks, the initial parameters of the networks are initialized with random values. Fix the random number stream so that the agent is always initialized with the same parameter values.
rng(0,"twister");
Define each network path as an array of layer objects, and assign names to the input and output layers of each path. These names allow you to connect the paths and then later explicitly associate the network input and output layers with the appropriate environment channel.
% Number of neurons L = 100; % Main path mainPath = [ featureInputLayer(prod(obsInfo.Dimension),Name="obsInLyr") fullyConnectedLayer(L) reluLayer fullyConnectedLayer(L) additionLayer(2,Name="add") reluLayer fullyConnectedLayer(L) reluLayer fullyConnectedLayer(1,Name="QValLyr") ]; % Action path actionPath = [ featureInputLayer(prod(actInfo.Dimension),Name="actInLyr") fullyConnectedLayer(L,Name="actOutLyr") ];
Assemble dlnetwork
object.
criticNet = dlnetwork(); criticNet = addLayers(criticNet,mainPath); criticNet = addLayers(criticNet,actionPath); criticNet = connectLayers(criticNet,"actOutLyr","add/in2");
Initialize dlnetwork
object and display the number of weights.
criticNet = initialize(criticNet); summary(criticNet)
Initialized: true Number of learnables: 21.6k Inputs: 1 'obsInLyr' 9 features 2 'actInLyr' 2 features
View the critic network configuration.
plot(criticNet)
Create the critic using the specified neural network and the environment action and observation specifications. Pass as additional arguments also the names of the network layers to be connected with the observation and action channels. For more information, see rlQValueFunction
(Reinforcement Learning Toolbox).
critic = rlQValueFunction(criticNet,obsInfo,actInfo,... ObservationInputNames="obsInLyr",ActionInputNames="actInLyr");
DDPG agents use a parametrized deterministic policy over continuous action spaces, which is learned by a continuous deterministic actor. This actor takes the current observation as input and returns as output an action that is a deterministic function of the observation.
To model the parametrized policy within the actor, use a neural network with one input layer (which receives the content of the environment observation channel, as specified by obsInfo
) and one output layer (which returns the action to the environment action channel, as specified by actInfo
).
Define the network as an array of layer objects.
actorNet = [ featureInputLayer(prod(obsInfo.Dimension)) fullyConnectedLayer(L) reluLayer fullyConnectedLayer(L) reluLayer fullyConnectedLayer(L) reluLayer fullyConnectedLayer(2) tanhLayer scalingLayer(Scale=[2.5;0.2618],Bias=[-0.5;0]) ];
Convert to dlnetwork
object, initialize the network, and display the number of weights.
actorNet = dlnetwork(actorNet); actorNet = initialize(actorNet); summary(actorNet)
Initialized: true Number of learnables: 21.4k Inputs: 1 'input' 9 features
Construct the actor similarly to the critic. For more information, see rlContinuousDeterministicActor
(Reinforcement Learning Toolbox).
actor = rlContinuousDeterministicActor(actorNet,obsInfo,actInfo);
Specify training options for the critic and the actor using rlOptimizerOptions
(Reinforcement Learning Toolbox).
Use learning rates 1e-4 and 1e-3 for the actor and critic respectively. A smaller learning rate can stabilize the training but at the cost of increased training time.
Clip gradients using the gradient threshold value of 1. This can prevent instabiity in learning due to large gradient values.
Use the L2 regularization factor of 1e-4 to further stabilize the learning process.
criticOptions = rlOptimizerOptions( ... LearnRate=1e-3, ... GradientThreshold=1, ... L2RegularizationFactor=1e-4); actorOptions = rlOptimizerOptions( ... LearnRate=1e-4, ... GradientThreshold=1, ... L2RegularizationFactor=1e-4);
Specify the DDPG agent options using rlDDPGAgentOptions
(Reinforcement Learning Toolbox).
Specify the experience buffer length
1e6
. A larger buffer can store a diverse set of experiences.Specify the initial standard deviation of
[0.6;0.1]
for the two actions, and a decay rate of1e-5
for the standard deviation. The decay facilitates exploration towards the beginning and exploitation towards the end of the training.Specify the same time
Ts
for the agent.
agentOptions = rlDDPGAgentOptions(... SampleTime=Ts,... ActorOptimizerOptions=actorOptions,... CriticOptimizerOptions=criticOptions,... ExperienceBufferLength=1e6); agentOptions.NoiseOptions.StandardDeviation = [0.6;0.1]; agentOptions.NoiseOptions.StandardDeviationDecayRate = 1e-5;
Create the DDPG agent using the actor, the critic, and the agent options. For more information, see rlDDPGAgent
(Reinforcement Learning Toolbox).
agent = rlDDPGAgent(actor,critic,agentOptions);
Train Agent
To train the agent, first specify the training options. For this example, use the following options:
Run each training episode for at most
10000
episodes, with each episode lasting at mostmaxsteps
time steps.Display the training progress in the Reinforcement Learning Training Monitor dialog box (set the
Verbose
andPlots
options).Evaluate the performance of the greedy policy every
50
training episodes, averaging the cumulative reward of 5 simulations.Stop training when the evaluation score
1600
.
For more information, see rlTrainingOptions
(Reinforcement Learning Toolbox).
maxepisodes = 1e4; maxsteps = ceil(Tf/Ts); trainingOpts = rlTrainingOptions(... MaxEpisodes=maxepisodes,... MaxStepsPerEpisode=maxsteps,... Verbose=false,... Plots="training-progress",... StopTrainingCriteria="EvaluationStatistic",... StopTrainingValue=1600); % agent evaluator evl = rlEvaluator(EvaluationFrequency=50, NumEpisodes=5);
Fix the random stream for reproducibility.
rng(0,"twister");
Train the agent using the train
(Reinforcement Learning Toolbox) function. Training is a computationally intensive process that takes several minutes to complete. To save time while running this example, load a pretrained agent by setting doTraining
to false
. To train the agent yourself, set doTraining
to true
.
doTraining = false; if doTraining % Train the agent. trainingStats = train(agent,env,trainingOpts,Evaluator=evl); else % Load a pretrained agent for the example. load("SimulinkPFCDDPG.mat","agent") end
Simulate DDPG Agent
Fix the random stream for reproducibility.
rng(0,"twister");
To demonstrate the trained agent using deterministic initial conditions, simulate the model in Simulink. To validate the performance of the trained agent with random initial conditions set userSpecifiedConditions
to false
.
userSpecifiedConditions = true; if userSpecifiedConditions e1_initial = -0.4; e2_initial = 0.1; x0_lead = 80; sim(mdl); else simOptions = rlSimulationOptions(MaxSteps=maxsteps); experience = sim(env,agent,simOptions); end
For more information on agent simulation, see rlSimulationOptions
(Reinforcement Learning Toolbox) and sim
(Reinforcement Learning Toolbox).
The following plots show the simulation results when the lead car is 70 (m) ahead of the ego car.
In the first 35 seconds, the relative distance is greater than the safe distance (bottom-right plot), so the ego car tracks the set velocity (top-right plot). To speed up and reach the set velocity, the acceleration is mostly nonnegative (top-left plot).
From 35 to 42 seconds, the relative distance is mostly less than the safe distance (bottom-right plot), so the ego car tracks the minimum of the lead velocity and set velocity. Because the lead velocity is less than the set velocity (top-right plot), to track the lead velocity, the acceleration becomes nonzero (top-left plot).
From 42 to 58 seconds, the ego car tracks the set velocity (top-right plot) and the acceleration remains zero (top-left plot).
From 58 to 60 seconds, the relative distance becomes less than the safe distance (bottom-right plot), so the ego car slows down and tracks the lead velocity.
The bottom-left plot shows the lateral deviation. As shown in the plot, the lateral deviation is greatly decreased within 1 second. The lateral deviation remains less than 0.05 m.
Restore the random number stream using the information stored in previousRngState
.
rng(previousRngState);
Environment Reset Function
function in = localResetFcn(in) % random value for initial position of lead car in = setVariable(in,"x0_lead",40+randi(60,1,1)); % random value for lateral deviation in = setVariable(in,"e1_initial", 0.5*(-1+2*rand)); % random value for relative yaw angle in = setVariable(in,"e2_initial", 0.1*(-1+2*rand)); end
See Also
train
(Reinforcement Learning Toolbox)
Related Topics
- Train Reinforcement Learning Agents (Reinforcement Learning Toolbox)
- Create Policies and Value Functions (Reinforcement Learning Toolbox)