High fluctuation in Q0 value for TD3 agent while training.

6 visualizaciones (últimos 30 días)
James Sorokhaibam
James Sorokhaibam el 12 de Mayo de 2024
Respondida: Ronit el 23 de Mayo de 2024
I am training a TD3 RL agent for pick and place robot. The reward function is, reward = exp(-E/d) where E is the total energy consumed where the trajectory is complete and d is the distance of the object from the end-effector. The training went smoothly while using DQN agent but it fails when DDPG, TD3 are used. What could be the reasion for this? I used the following code for agent creation.
obsInfo = rlNumericSpec([34 1]);
actInfo = rlNumericSpec([14 1], ...
LowerLimit=-1, ...
UpperLimit= 1);
env = rlFunctionEnv(obsInfo,actInfo,"KondoStepFunction","KondoResetFunction");
agent = rlTD3Agent(obsInfo,actInfo);

Respuestas (1)

Ronit
Ronit el 23 de Mayo de 2024
Hello James,
To understand why there are high fluctuations while using different RL agents, firstly we need to understand how these agents work.
  • The primary difference between DQN and agents like DDPG and TD3 is that DQN is just a value-based learning method, whereas DDPG and TD3 use the actor-critic method.
  • The DQN network tries to predict the Q values for each state-action pair, so it is just a single model. On the other hand, DDPG has a critic model that determines the Q value but uses the actor model to determine the action to take. Hence, we can say DDPG tries to directly learn the policy whereas DQN learns the Q values which are used to define the policy, generally an epsilon-greedy policy.
  • So, training an agent with DDPG or TD3 must be done more carefully. Not only because its learning is sometimes unstable, but because the number of hyperparameters to fine-tune in it is pretty much double that of DQN.
Here are a few suggestions which can help in getting good results using TD3 or DDPG agents:
  1. Tune Hyperparameters: Adjust learning rates, replay buffer size, and exploration noise.
  2. Normalize Rewards: Consider scaling your reward to reduce variability and improve learning stability.
  3. Monitor Training: Use diagnostics to understand action, reward, and learning dynamics better.
Adjusting these aspects can help mitigate the high fluctuation and improve your TD3 agent's training performance.
Hope this helps!

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by