How to set DQN network to approach Q0 ?

10 visualizaciones (últimos 30 días)
I'm using the reinforcement learning toolbox to design and train a DQN agent.
My action space is a discrete action space composed by 24 elements. The network is a convolutional neural network inspired by the dueling DQN architecture, these are my convolutional layers:
Layers = [
imageInputLayer([h w 5],"Name","imageinput","Normalization","none")
convolution2dLayer([8 8],64,"Name","conv_1","Padding","same","Stride",[4 4])
reluLayer("Name","relu1")
convolution2dLayer([8 8],64,"Name","conv_1.2","Padding","same")
reluLayer("Name","relu1.2")
convolution2dLayer([4 4],128,"Name","conv_2","Padding","same","Stride",[2 2])
reluLayer("Name","relu2")
convolution2dLayer([4 4],128,"Name","conv_2.2","Padding","same")
reluLayer("Name","relu2.2")
convolution2dLayer([4 4],128,"Name","conv_2.3","Padding","same")
reluLayer("Name","relu2.3")
convolution2dLayer([3 3],256,"Name","conv_3","Padding","same","Stride",[2 2])
reluLayer("Name","relu3")
convolution2dLayer([3 3],256,"Name","conv_3.2","Padding","same")
reluLayer("Name","relu3.2")
convolution2dLayer([3 3],256,"Name","conv_3.3","Padding","same")
reluLayer("Name","relu3.3")];
I'm obtaining bad results during training in the sense that the reward continuously oscillates in the same range of values without improving as if the agent is not learning. Moreover at some point the episode Q0 diverges drastically, I've found on the documentation that: "For agents with a critic, Episode Q0 is the estimate of the discounted long-term reward at the start of each episode, given the initial observation of the environment. As training progresses, if the critic is well designed. Episode Q0 approaches the true discounted long-term reward, as shown in the preceding figure."
  • Therefore my question is the following : in order to make the Q0 approach the episode reward values how can I modify my network? May be there other problems?
The parameters I'm using for the agent are the following:
criticOpts.Optimizer = 'adam';
criticOpts.LearnRate = 0.00025;
agentOpts.UseDoubleDQN = true;
agentOpts.ExperienceBufferLength = 1e6;
agentOpts.NumStepsToLookAhead = 1;
agentOpts.DiscountFactor = 0.99;
agentOpts.EpsilonGreedyExploration.Epsilon = 1;
agentOpts.EpsilonGreedyExploration.EpsilonMin = 0.1;
agentOpts.EpsilonGreedyExploration.EpsilonDecay = 0.01;
agentOpts.MiniBatchSize = 64;
agentOpts.TargetUpdateMethod = 'smoothing';
agentOpts.TargetUpdateFrequency = 1;
agentOpts.TargetSmoothFactor = 1e-3;

Respuesta aceptada

Emmanouil Tzorakoleftherakis
Emmanouil Tzorakoleftherakis el 5 de Mzo. de 2021
There is no single answer here that will get the training to work. My first instict would be to go for a simpler architecture without convolutional layers, get some result that makes sense (so that you get an idea which hyperparameters are working) and then move to the dueling DQN architecture.
You would still need to experiment with hyperparams though. First off, reduce the epsilon decay rate to let the agent explore more, and then play with experience buffer length and mini-batch size (the other params can be left at their default values initially).
A couple of things to keep in mind: 1) As of R2020b release, the default agent lets you create an agent just by providing observation and action info (so no need to create neural network architectures yourself). Take a look here. 2) If you still want to create your own network, make sure you use the multi-output critic architecture.
When you get to the point where you can move to dueling DQN, I would start with some architectures published in papers. For example, at first glance the architecture you are showing seems to have a lot of conv layers - did you see this in some paper? This paper whicn talks about the same topic may be a good place to start.
Hope this helps
  4 comentarios
Matteo Padovani
Matteo Padovani el 6 de Mzo. de 2021
It was really helpfull since the documentation was not clear for me.
I have a very last question, I thought of using the architecture you suggested but there is the issue of summing the Value estimate with the Advantage values in order to obtain the Q values since the network has to output them in order to define the agent, the only way of doing that is to create a custom layer that performs the averaged summation?
And again thaks a lot.
Emmanouil Tzorakoleftherakis
Emmanouil Tzorakoleftherakis el 6 de Mzo. de 2021
You can use the additionLayer - Here is an example that shows how to use it to create a critic. As I mentioned, I would start with something simple, that does not consider the advantage estimation.

Iniciar sesión para comentar.

Más respuestas (0)

Categorías

Más información sobre Training and Simulation en Help Center y File Exchange.

Productos


Versión

R2020b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by