RL: Continuous action space, but within a desired range（use PPO)

Question

Peijie el 26 de Oct. de 2023

0
Enlazar

Enlace directo a esta pregunta

https://la.mathworks.com/matlabcentral/answers/2038971-rl-continuous-action-space-but-within-a-desired-range-use-ppo

Respondida: Nicolas CRETIN el 10 de Ag. de 2024

I am now trying to use a PPO in RL training with continuous action space.

However, I want to ensure that the output of my actor always stays within the upper and lower bounds I set. In my environment, I'm using the following code, and my actor network and critic network are as follows.

% observation info
ObservationInfo = rlNumericSpec([n_Pd+n_Pg+1, 1]);
% action info
ActionInfo = rlNumericSpec([n_Pg, 1], ...
    'Lowerlimit', Pgmin, ...
    'Upperlimit', Pgmax);

Actor network

%% Actor Network
% Input path layers
inPath = [featureInputLayer(numObservations,'Normalization','none','Name','observation')
        fullyConnectedLayer(128,'Name','ActorFC1')
        reluLayer('Name','ActorRelu1')
        fullyConnectedLayer(128,'Name','ActorFC2')
        reluLayer('Name', 'ActorRelu2')
        fullyConnectedLayer(numActions,'Name','Action')
         ];
% Path layers for mean value 
meanPath = [ 
    tanhLayer(Name="tanhMean");
    fullyConnectedLayer(numActions);
    scalingLayer('Name','ActorScaling','Scale',actInfo.UpperLimit) 
    ];
% Path layers for standard deviations
% Using softplus layer to make them non negative
sdevPath = [ 
    tanhLayer(Name="tanhStdv");
    fullyConnectedLayer(numActions);
    softplusLayer(Name="Splus") 
    ];
% Add layers to network object
actorNetwork = layerGraph(inPath);
actorNetwork = addLayers(actorNetwork,meanPath);
actorNetwork = addLayers(actorNetwork,sdevPath);
% Connect layers
actorNetwork = connectLayers(actorNetwork,"Action","tanhMean/in");
actorNetwork = connectLayers(actorNetwork,"Action","tanhStdv/in");
actorNetwork = dlnetwork(actorNetwork);
% figure(2)
% plot(layerGraph(actorNetwork))
% Setting Actor
actorOptions = rlOptimizerOptions('LearnRate',0.1,'GradientThreshold',inf);
actor = rlContinuousGaussianActor(actorNetwork,obsInfo,actInfo, ...
    "ActionMeanOutputNames","ActorScaling", ...
    "ActionStandardDeviationOutputNames","Splus");

Critic network

%% Critic Network
criticNetwork = [
        featureInputLayer(numObservations,'Normalization','none','Name','observation')
        fullyConnectedLayer(128,'Name','CriticFC1')
        reluLayer('Name','CriticRelu1')
        fullyConnectedLayer(1,'Name','CriticOutput')];
criticNetwork = dlnetwork(criticNetwork);
% Setting Critic
criticOptions = rlOptimizerOptions('LearnRate',0.1,'GradientThreshold',inf);
critic = rlValueFunction(criticNetwork,obsInfo);

something eles

%% Create PPO Agent
% Setting PPO Agent Options
agentOptions = rlPPOAgentOptions(...
    'SampleTime',Ts,...
    'ActorOptimizerOptions',actorOptions,...
    'CriticOptimizerOptions',criticOptions,...
    'ExperienceHorizon',600,... 
    'ClipFactor',0.02,...
    'EntropyLossWeight',0.01,...
    'MiniBatchSize',300, ...
    'AdvantageEstimateMethod','gae',...
    'GAEFactor',0.95,...
    'DiscountFactor',0.99);
% Create Agent
agent = rlPPOAgent(actor,critic,agentOptions);
%% Train Agent
maxepisodes = 10000;
maxsteps = ceil(Nt/Ts);
trainingOptions = rlTrainingOptions(...
    'MaxEpisodes',maxepisodes,...
    'MaxStepsPerEpisode',maxsteps,...
    'StopOnError',"on",...
    'Plots',"training-progress",...
    'StopTrainingCriteria',"AverageReward",...
    'StopTrainingValue',-14500,...
    'SaveAgentCriteria',"EpisodeReward",...
    'SaveAgentValue',-14500); 
% train? 1-train; 0-not train
doTraining = 1;
if doTraining    
    % Train the agent.
    trainingStats = train(agent,env,trainingOptions);
    save('XXX.mat','agent')
else
    % Load the pretrained agent for the example.
    load('XXX.mat','agent')       
end

THANKS!

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

Iniciar sesión para comentar.

Iniciar sesión para responder a esta pregunta.

Answer 1

Emmanouil Tzorakoleftherakis el 27 de Oct. de 2023

0
Enlazar

Enlace directo a esta respuesta

https://la.mathworks.com/matlabcentral/answers/2038971-rl-continuous-action-space-but-within-a-desired-range-use-ppo#answer_1342436

You can always clip the agent output on the environment side. PPO is stochastic so the upper and lower limits are not guaranteed to be respected with the current implementation.

2 comentarios
Mostrar NingunoOcultar Ninguno

Peijie el 4 de Jul. de 2024

Pls is this algorithm updated in the 2024a version, thank you

Emmanouil Tzorakoleftherakis el 4 de Jul. de 2024

Updated with respect to what? What I mentioned above is still true

Iniciar sesión para comentar.

Answer 2

Nicolas CRETIN el 19 de Jul. de 2024

0
Enlazar

Enlace directo a esta respuesta

https://la.mathworks.com/matlabcentral/answers/2038971-rl-continuous-action-space-but-within-a-desired-range-use-ppo#answer_1488206

Hello,

I'm quite beginner in this field, but I faced the same issue and I used a sigmoïd layer and then a scaling layer to bypass the issue.

The sigmoïd layer outputs a value between 0 and one, and then you can rescale it with a linear function within the desired range. This is to be applied only to the action path of the actor net, excepted if you also want to scale your standard deviation.

But I'm quite surprised that Emmanouil said there is no way to do it. Did I miss a side effect or something ?

Hope it helps Regards, Nicolas

1 comentario
Mostrar -1 comentarios más antiguosOcultar -1 comentarios más antiguos

Peijie el 5 de Ag. de 2024

Thank you very much!But I still have some issues after trying. Could I refer to the structure of your actor network? How do you write the softmax layer and scaling layer you mentioned? I appreciate it immensely.

Thanks!

Iniciar sesión para comentar.

Answer 3

Nicolas CRETIN el 10 de Ag. de 2024

0
Enlazar

Enlace directo a esta respuesta

https://la.mathworks.com/matlabcentral/answers/2038971-rl-continuous-action-space-but-within-a-desired-range-use-ppo#answer_1497384

Abrir en MATLAB Online

Hi Peijie,

I'll create a new answer!

As said here (Proximal policy optimization (PPO) reinforcement learning agent - MATLAB - MathWorks France): "For continuous action spaces, this agent does not enforce constraints set in the action specification".

But, to bypass this issue, you can:

1) clip the agent output on the environment side, with a saturation block (Limit input signal to the upper and lower saturation values - Simulink - MathWorks France) as Emmanouil said.

2) use sigmoïd or tanh layers to scale the output of your layer within the range ]0, 1[ or ]-1, 1[ and then rescale it with a scaling layer within your desired range.

In the case of a PPO agent, you have two output paths which correpsond to the standard deviation and the mean value of your action (action is taken following a gaussian distribution since it's a stochastic agent).

For example, if you want to scale your action output within the range ]-10, 10[, you can use a tanh layer followed by a scaling layer:

...
tanhLayer(Name="tanh") % scaled within ]-1, 1[
scalingLayer(Name="scaling",Scale=10,Bias=0) % rescale it within ]-10, 10[
... 

But keep in mind, that the action of a PPO agent is stochastic, so if you don't bound the range of values of your standard deviation path, you will have no guarantee that your action stay within the desired range. Moreover, I believe that softplusLayers are used as default outputs for the sdandard deviation path, which will not work in your case: you can use a sigmoïd instead for example (which ensures your standard deviation to be always positive, contrary to tanh)

By the way, tanh layers are supposed to be better than sigmoid, in most cases, and SoftPlusLayers are usually not recommended (see: 2010.09458 (arxiv.org))

3) You should also consider using DDPG agents, since they are determinisitc.

Finally, 1) and 2) together could be efficient: 1) will guarantee that your agent learn its actions among a meaningfull range of values, and 2) will mathematically guarantee that your action doesn't exceed the range of acceptable values (which is not the case so far, as PPO is stochastic).

Some interesting documentation:

Create Simulink Environment and Train Agent - MATLAB & Simulink - MathWorks France

List of Deep Learning Layers - MATLAB & Simulink - MathWorks France

Hope it helps,

Regards,

Nicolas

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

Iniciar sesión para comentar.

RL: Continuous action space, but within a desired range（use PPO)

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

Respuestas (3)

2 comentarios
Mostrar NingunoOcultar Ninguno

1 comentario
Mostrar -1 comentarios más antiguosOcultar -1 comentarios más antiguos

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

Ver también

Categorías

Etiquetas

Productos

Versión

Community Treasure Hunt

RL: Continuous action space, but within a desired range（use PPO)

0 comentarios Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

Respuestas (3)

2 comentarios Mostrar NingunoOcultar Ninguno

1 comentario Mostrar -1 comentarios más antiguosOcultar -1 comentarios más antiguos

0 comentarios Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

Ver también

Categorías

Etiquetas

Productos

Versión

Community Treasure Hunt

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

2 comentarios
Mostrar NingunoOcultar Ninguno

1 comentario
Mostrar -1 comentarios más antiguosOcultar -1 comentarios más antiguos

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos