RL DQN agent Episode Q0 does not converge to Average Reward

Question

Amin Moradi el 24 de Feb. de 2022

1
Enlazar

Enlace directo a esta pregunta

https://la.mathworks.com/matlabcentral/answers/1658025-rl-dqn-agent-episode-q0-does-not-converge-to-average-reward

Respondida: Ronit el 16 de Feb. de 2024

I'm using Reinforcement Learning Toolbox for MATLAB R2021b and I'm training a DQN agent. After choosing a appropriate discount factor and other parameters, it seems that my average rewards are good and correct but the problem is my Epsiode Q0 won't converge to Average Rewards. I have attached the training results. I would be grateful if someone can help me on correcting this or informing me of the possible reasons that this error would happen. Here is my code for training part, you can see the training parameters in the code:

ObservationInfo = rlNumericSpec([1 11]);
ObservationInfo.Name = 'Line State';
ObservationInfo.Description = 'line1, line2, line3, line4, line5, line6, line7, line8, line9, line10, line11';
ObservationInfo.LowerLimit=0;
ObservationInfo.UpperLimit=1;
ActionInfo = rlFiniteSetSpec([1 2 3 4 5 6 7 8 9 10 11]);
ActionInfo.Name = 'Attacker Action';
ActionInfo.Description = ['attack-line1, attack-line2, attack-line3, attack-line4, ' ...
    'attack-line5, attack-line6, attack-line7, attack-line8, attack-line9, attack-line10, attack-line11'];
env = rlFunctionEnv(ObservationInfo, ActionInfo,'WW6_StepFunction_genloss','WW6_ResetFunction');
dnn = [
    featureInputLayer(obsInfo.Dimension(2),'Normalization','none','Name','state')
    fullyConnectedLayer(120,'Name','CriticStateFC1')
    reluLayer('Name','CriticRelu1')
    fullyConnectedLayer(120, 'Name','CriticStateFC2')
    reluLayer('Name','CriticCommonRelu')
    fullyConnectedLayer(length(actInfo.Elements),'Name','output')];
 
criticOpts = rlRepresentationOptions('LearnRate',0.001,'GradientThreshold',1);
critic = rlQValueRepresentation(dnn,obsInfo,actInfo,'Observation',{'state'},criticOpts);
agentOpts = rlDQNAgentOptions(...
    'NumStepsToLookAhead',1,... % used for parallel computing
    'UseDoubleDQN',true, ...    
    'TargetSmoothFactor',1e-1, ...
    'TargetUpdateFrequency',4, ...   
    'ExperienceBufferLength',100000, ...
    'DiscountFactor',0.7, ...
    'MiniBatchSize',256 ...
    );
agentOpts.EpsilonGreedyExploration.Epsilon=1;
agentOpts.EpsilonGreedyExploration.EpsilonDecay=0.005;
agentOpts.EpsilonGreedyExploration.EpsilonMin=0.1;
agent = rlDQNAgent(critic,agentOpts);

trainOpts = rlTrainingOptions(...
    'UseParallel',true,... % used for parallel computing
    'MaxEpisodes',8000, ...
    'MaxStepsPerEpisode',5, ...
    'Verbose',false, ...
    'Plots','training-progress',...
    'StopTrainingCriteria','AverageReward',...
    'StopTrainingValue',900);
trainOpts.ScoreAveragingWindowLength=20;
trainingStats = train(agent,env,trainOpts);

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

Iniciar sesión para comentar.

Iniciar sesión para responder a esta pregunta.

Answer 1

Ronit el 16 de Feb. de 2024

0
Enlazar

Enlace directo a esta respuesta

https://la.mathworks.com/matlabcentral/answers/1658025-rl-dqn-agent-episode-q0-does-not-converge-to-average-reward#answer_1410293

Hi,

I've noticed your concern regarding the convergence of Q0 on the track of average reward. It's important to recognize that if your model is already producing good results, the behaviour of Episode Q0 may not be a significant issue.

For more details regarding this you can refer to the following community answers:

Remember that reinforcement learning can be sensitive to hyperparameter settings and requires a lot of trial and error to find the right combination for a given problem. Should you decide to align Episode Q0 with the average reward track more closely, here are some adjustments you might consider:

Epsilon Decay Rate: Adjust the epsilon decay rate to ensure enough exploration throughout the training.
Larning rate: Experiment with different learning rates.
Discount Factor: Adjust the discount factor to better balance immediate and future rewards.
Target Network Update Frequency: Change the target network update frequency to improve stability.
Episodes: Increase the number of episodes or steps per episode.
Reward Function: Review and possibly redesign the reward function.
Step and Reset Functions: Check the implementation of your environment's step and reset functions for potential issues.

You can also use Bayesian Optimization, a framework provided by MATLAB through the ‘bayesopt’ function. It is an efficient method for global optimization of black-box functions that can be used to tune hyperparameters of an RL agent.

Following is the link for more details: https://www.mathworks.com/help/stats/bayesianoptimization.html?s_tid=doc_ta

Hope this helps!

Ronit Jain

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

Iniciar sesión para comentar.

RL DQN agent Episode Q0 does not converge to Average Reward

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

Respuestas (1)

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

Ver también

Categorías

Etiquetas

Productos

Versión

Community Treasure Hunt

RL DQN agent Episode Q0 does not converge to Average Reward

0 comentarios Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

Respuestas (1)

0 comentarios Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

Ver también

Categorías

Etiquetas

Productos

Versión

Community Treasure Hunt

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos