How to create an custom Reinforcement Learning Environment + DDPG agent

24 visualizaciones (últimos 30 días)
Hello,
i´m working on an Agent for a problem in the spectral domain. I want to dump frequencies in a spectrum in a way that the resulting spectrum is looking like a rect() function.
So i created the following Environment with a [3 1] continuous Observation and Actionspace as an abstract Version of the real Problem. The initial Observation is a rndm [3 1] vector with values between 0 and 2. The Actionspace is a [3 1] vector with values between -1 and 1. The reward function is maximal when every vector element equals a final vector like[1 1 1].
the Environment file looks like this:
classdef SimpleContinuousEnv < rl.env.MATLABEnvironment
%SIMPLETESTENV: Template for defining custom environment in MATLAB.
%% Properties (set properties' attributes accordingly)
properties
% Specify and initialize environment's necessary properties
MaxForce=1
% Reward for a good correlation between observation and
% targetfunction
RewardForGoodShaping = 10
% Penalty for a bad correlation beween observation and
% targetfunction
PenaltyForBadShaping = -10
end
properties
% Initialize system state
State = zeros(1,3)
end
properties(Access = protected)
% Initialize internal flag to indicate episode termination
IsDone = false
end
%% Necessary Methods
methods
% Contructor method creates an instance of the environment
% Change class name and constructor name accordingly
function this = SimpleContinuousEnv()
% Initialize Observation settings
ObservationInfo = rlNumericSpec([3 1],'UpperLimit',[5;5;5],'LowerLimit',[-5;-5;-5]);
ObservationInfo.Name = 'Test SpectraObs';
ObservationInfo.Description = '1D rndm Number Vector';
% Initialize Action settings
ActionInfo = rlNumericSpec([3 1],'UpperLimit',[1;1;1],'LowerLimit',[-1;-1;-1])
ActionInfo.Name = 'Attenuation Vector';
ActionInfo.Description = '1D Attenuation Vector';
% The following line implements built-in functions of RL env
this = this@rl.env.MATLABEnvironment(ObservationInfo,ActionInfo);
% Initialize property values and pre-compute necessary values
updateActionInfo(this);
end
% Apply system dynamics and simulates the environment with the
% given action for one step.
function [Observation,Reward,IsDone,LoggedSignals] = step(this,Action)
LoggedSignals = [];
% Get action
Force = getForce(this,Action);
Observation=this.State+Force;
% Update system states
this.State = Observation;
% Check terminal condition
parse=(Observation==[1 1 1]);
IsDone=all(ismember(parse,1));
this.IsDone=IsDone;
% Get reward
Reward = getReward(this,parse);
% (optional) use notifyEnvUpdated to signal that the
% environment has been updated (e.g. to update visualization)
notifyEnvUpdated(this);
end
% Reset environment to initial state and output initial observation
function InitialObservation = reset(this)
InitialObservation = double([randi([1 2]);randi([1 2]);randi([1 2])]);
this.State = InitialObservation;
% (optional) use notifyEnvUpdated to signal that the
% environment has been updated (e.g. to update visualization)
notifyEnvUpdated(this);
end
end
%% Optional Methods (set methods' attributes accordingly)
methods
% Helper methods to create the environment
% Discrete force 1 or 2
function force = getForce(this,action)
force = action;
end
% update the action info based
function updateActionInfo(this)
% this.ActionInfo.Elements = this.ActionInfo.Elements;
end
% Reward function
function Reward = getReward(this,parse)
if this.IsDone==1
Reward = sum(parse*this.RewardForGoodShaping);
else
Reward = sum(~parse*this.PenaltyForBadShaping);
end
end
% (optional) Properties validation through set methods
function set.State(this,state)
validateattributes(state,{'numeric'},{'finite','real','vector','numel',3},'','State');
this.State = double(state);
notifyEnvUpdated(this);
end
function set.RewardForGoodShaping(this,val)
validateattributes(val,{'numeric'},{'real','finite','scalar'},'','RewardForGoodShaping');
this.RewardForGoodShaping = val;
end
function set.PenaltyForBadShaping(this,val)
validateattributes(val,{'numeric'},{'real','finite','scalar'},'','PenaltyForBadShaping');
this.PenaltyForBadShaping = val;
end
end
methods (Access = protected)
% (optional) update visualization everytime the environment is updated
% (notifyEnvUpdated is called)
function envUpdatedCallback(this)
plot(this.State)
hold off
XLimMode = 'auto';
YLimMode = 'auto';
end
end
end
This Environment validates but when i start Training with the folowing DNN and Training Options:
%%
%Create the Agent: Deep Q-Network Agent
%A DQN agent approximates the long-term reward given observations and
%actions using a critiv value function represenntation. To create the
%critic, frist create a deep neural network with two inputs, the state and
%the action, and one output.
statePath = [
imageInputLayer([3 1], 'Normalization','none','Name','state')
fullyConnectedLayer(24,'Name','CriticStateFC1')
reluLayer('Name','CriticRelu1')
fullyConnectedLayer(24,'Name','CriticStateFC2')];
actionPath = [
imageInputLayer([3 1],'Normalization','none','Name','action')
fullyConnectedLayer(24,'Name','CriticActionFC1')];
commonPath = [
additionLayer(2,'Name','add')
reluLayer('Name','CriticCommonRelu')
fullyConnectedLayer(1,'Name','output')];
criticNetwork = layerGraph(statePath);
criticNetwork = addLayers(criticNetwork, actionPath);
criticNetwork = addLayers(criticNetwork, commonPath);
criticNetwork = connectLayers(criticNetwork,'CriticStateFC2','add/in1');
criticNetwork = connectLayers(criticNetwork,'CriticActionFC1','add/in2');
figure(1)
plot(criticNetwork)
%figure(2)
%hold on
%%
%Specify Options for the critic representation using
%rlRepresentationOptions
criticOpts = rlRepresentationOptions('LearnRate', 0.01, 'GradientThreshold',1, 'UseDevice',"gpu");
obsInfo=getObservationInfo(env);
actInfo=getActionInfo(env);
critic = rlQValueRepresentation(criticNetwork,obsInfo,actInfo,'Observation',{'state'},'Action',{'action'},criticOpts);
%%
%To Create a DDPG Agent with an Continuous action space.
actorNetwork = [
imageInputLayer([3 1],'Normalization','none','Name','state')
fullyConnectedLayer(3,'Name','action','BiasLearnRateFactor',1,'BiasInitializer','zeros','Bias',[0;0;0])];
actorOpts = rlRepresentationOptions('LearnRate',1e-04,'GradientThreshold',1);
actor = rlDeterministicActorRepresentation(actorNetwork,obsInfo,actInfo,'Observation',{'state'},'Action',{'action'},actorOpts);
%To create the DDPG agent, first specify the DDPG agent options using
%rlDDPGAgentoptions
agentOpts = rlDDPGAgentOptions(...
'SampleTime',1, ...
'TargetSmoothFactor',1e-3, ...
'ExperienceBufferLength',1e6, ...
'DiscountFactor',0.99, ...
'MiniBatchSize',32);
agentOpts.NoiseOptions.Variance = 0.3;
agentOpts.NoiseOptions.VarianceDecayRate = 1e-6;
%Then, create the DDPG Agent using the specified actor representation,
%critic representation and agent options.
agent = rlDDPGAgent(actor,critic,agentOpts);
trainOpts = rlTrainingOptions(...
'MaxEpisodes', 500, ...
'MaxStepsPerEpisode', 10, ...
'Verbose', true, ...
'Plots','training-progress',...
'StopTrainingCriteria','EpisodeCount',...
'StopTrainingValue',5);
i get a big error in the training process calculating the cumulative reward. Anyway there seem to be many problems in this code i can´t figure out completly using the given examples from the toolbox.
Error using rl.agent.AbstractPolicy/step (line 116)
Invalid input argument type or size such as observation, reward, isdone or loggedSignals.
Error in rl.env.MATLABEnvironment/simLoop (line 241)
action = step(policy,observation,reward,isdone);
Error in rl.env.MATLABEnvironment/simWithPolicyImpl (line 106)
[expcell{simCount},epinfo,siminfos{simCount}] = simLoop(env,policy,opts,simCount,usePCT);
Error in rl.env.AbstractEnv/simWithPolicy (line 70)
[experiences,varargout{1:(nargout-1)}] = simWithPolicyImpl(this,policy,opts,varargin{:});
Error in rl.task.SeriesTrainTask/runImpl (line 33)
[varargout{1},varargout{2}] = simWithPolicy(this.Env,this.Agent,simOpts);
Error in rl.task.Task/run (line 21)
[varargout{1:nargout}] = runImpl(this);
Error in rl.task.TaskSpec/internal_run (line 159)
[varargout{1:nargout}] = run(task);
Error in rl.task.TaskSpec/runDirect (line 163)
[this.Outputs{1:getNumOutputs(this)}] = internal_run(this);
Error in rl.task.TaskSpec/runScalarTask (line 187)
runDirect(this);
Error in rl.task.TaskSpec/run (line 69)
runScalarTask(task);
Error in rl.train.SeriesTrainer/run (line 24)
run(seriestaskspec);
Error in rl.train.TrainingManager/train (line 291)
run(trainer);
Error in rl.train.TrainingManager/run (line 160)
train(this);
Error in rl.agent.AbstractAgent/train (line 54)
TrainingStatistics = run(trainMgr);
Error in DQN_Agent_for_LaserSpectrum_optimization (line 124)
trainingStats = train(agent,env,trainOpts);
Caused by:
Error using rl.agent.AbstractPolicy/step (line 103)
Error setting property 'CumulativeReward' of class 'rl.util.EpisodeInfo'. Value must be a scalar.
So i have many questions. First how to fix this error?
Second is such an agent even possible if my system is doing its work in 1 timestep. Like getting the observation vector and then with choosing one action vector reaching maximum reward?
If Yes which kind of agent maybe the best?
Third: How can i scale up the example to a much bigger vector and how can i shrink the action space maybe with constraints?
Fourth: how can i define one action which is a 1D vector and than define the range each element of the vector can have?In my case i wld have a n x 1 vector and each element can be in the range between 0-35.
Fifth: If my model is fine in simulation, how can i create an agent which works with a "real" hardware environment? Do i ve to write the environment in a way that i coontrols the hardware? Is there an example how this may work?
Thanks in advance for the answers and sry for the very long question.
best regards
Kai
  2 comentarios
Kai Tybussek
Kai Tybussek el 2 de Jul. de 2020
and how do define UpdateActionInfo within the environment when the actionspace is continuous?
Fangyuan Chang
Fangyuan Chang el 31 de Oct. de 2020
have the same issue in the defination of UpdateActionInfo...have you figured it out? Thx!

Iniciar sesión para comentar.

Respuesta aceptada

Emmanouil Tzorakoleftherakis
Emmanouil Tzorakoleftherakis el 2 de Jul. de 2020
Hi Kai,
What the very first error is telling you is that there is an issue with the dimensions of either your observation, reward, isdone or loggedSignals. In fact if you check the lines
% Check terminal condition
parse=(Observation==[1 1 1]);
IsDone=all(ismember(parse,1));
this.IsDone=IsDone;
in your environment you will see that you are assigning a vector to IsDone, but you IsDone is supposed to be a scalar. I changed it to a scalar and training started properly (cannot comment on the other hyperparameters of the problem).
Some more answers to your questions:
Second is such an agent even possible if my system is doing its work in 1 timestep. Like getting the observation vector and then with choosing one action vector reaching maximum reward?
If Yes which kind of agent maybe the best?
Not sure what you mean here. The time step and agent sample time are determined on a case by case depending on the problem you work with.
Third: How can i scale up the example to a much bigger vector and how can i shrink the action space maybe with constraints?
Reinforcement learning does not typically consider hard constraints in the problem formulation, so if you have constraints in your problem you would probably need to treat them as soft and add penalties in your reward signal if they are violated.
Fourth: how can i define one action which is a 1D vector and than define the range each element of the vector can have?In my case i wld have a n x 1 vector and each element can be in the range between 0-35.
actInfo = rlNumericSpec([n 1],'LowerLimit',0,'UpperLimit',35);%for continous action spaces
See this example if your inputs are discrete.
Fifth: If my model is fine in simulation, how can i create an agent which works with a "real" hardware environment? Do i ve to write the environment in a way that i coontrols the hardware? Is there an example how this may work?
There is no out-of-the-box functionality for this yet so you would have to implement the communication part yourself (but we are actively working on this).
and how do define UpdateActionInfo within the environment when the actionspace is continuous?
For the UpdateActionInfo question, this is only called once in the constructor of the environment class so it is certainly not necessary to implement if you set up the actionspace otherwise.
Hope this helps
  9 comentarios
Kai Tybussek
Kai Tybussek el 8 de Jul. de 2020
i updated to Update 3 but still got an error when i train with GPU and no error with CPU.
"InitialObservation = double([randi([1 2]);randi([1 2]);randi([1 2])]);"
i think this may cause trouble in gpu calculation. When i set breakpoints one gpu array is single the other array is double.
Emmanouil Tzorakoleftherakis
Emmanouil Tzorakoleftherakis el 8 de Jul. de 2020
I think it would be better to contact technical support at this point and provide the exact reproduction model and error you are seeing. They would be able to get in touch with the development team if necessary.

Iniciar sesión para comentar.

Más respuestas (0)

Categorías

Más información sobre Applications en Help Center y File Exchange.

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by