How to create an custom Reinforcement Learning Environment + DDPG agent

Question

Kai Tybussek el 2 de Jul. de 2020

0
Enlazar

Enlace directo a esta pregunta

https://la.mathworks.com/matlabcentral/answers/558376-how-to-create-an-custom-reinforcement-learning-environment-ddpg-agent

Comentada: Fangyuan Chang el 31 de Oct. de 2020

Respuesta aceptada: Emmanouil Tzorakoleftherakis

Hello,

i´m working on an Agent for a problem in the spectral domain. I want to dump frequencies in a spectrum in a way that the resulting spectrum is looking like a rect() function.

So i created the following Environment with a [3 1] continuous Observation and Actionspace as an abstract Version of the real Problem. The initial Observation is a rndm [3 1] vector with values between 0 and 2. The Actionspace is a [3 1] vector with values between -1 and 1. The reward function is maximal when every vector element equals a final vector like[1 1 1].

the Environment file looks like this:

classdef SimpleContinuousEnv < rl.env.MATLABEnvironment
    %SIMPLETESTENV: Template for defining custom environment in MATLAB.    
    
    %% Properties (set properties' attributes accordingly)
    properties
        % Specify and initialize environment's necessary properties
        MaxForce=1
     
          
        % Reward for a good correlation between observation and
        % targetfunction
        RewardForGoodShaping = 10
        
        % Penalty for a bad correlation beween observation and
        % targetfunction
        PenaltyForBadShaping = -10
           
    end
    
    properties
        % Initialize system state 
        State = zeros(1,3)
    end
    
    properties(Access = protected)
        % Initialize internal flag to indicate episode termination
        IsDone = false
        
    end
    %% Necessary Methods
    methods              
        % Contructor method creates an instance of the environment
        % Change class name and constructor name accordingly
        function this = SimpleContinuousEnv()
            % Initialize Observation settings
            ObservationInfo = rlNumericSpec([3 1],'UpperLimit',[5;5;5],'LowerLimit',[-5;-5;-5]);
            ObservationInfo.Name = 'Test SpectraObs';
            ObservationInfo.Description = '1D rndm Number Vector';
            
            % Initialize Action settings   
            ActionInfo = rlNumericSpec([3 1],'UpperLimit',[1;1;1],'LowerLimit',[-1;-1;-1])
            ActionInfo.Name = 'Attenuation Vector';
            ActionInfo.Description = '1D Attenuation Vector';
            
            % The following line implements built-in functions of RL env
            this = this@rl.env.MATLABEnvironment(ObservationInfo,ActionInfo);
            
            % Initialize property values and pre-compute necessary values
            updateActionInfo(this);
        end
        
        % Apply system dynamics and simulates the environment with the 
        % given action for one step.
        function [Observation,Reward,IsDone,LoggedSignals] = step(this,Action)
            LoggedSignals = [];
            
            % Get action
            Force = getForce(this,Action);            
            
            Observation=this.State+Force;
           
            % Update system states
            this.State = Observation;
            
            % Check terminal condition                                    
            parse=(Observation==[1 1 1]);
            
            IsDone=all(ismember(parse,1));
            this.IsDone=IsDone;
           
            % Get reward
            Reward = getReward(this,parse);
            
            % (optional) use notifyEnvUpdated to signal that the 
            % environment has been updated (e.g. to update visualization)
            notifyEnvUpdated(this);
        end
        
        % Reset environment to initial state and output initial observation
        function InitialObservation = reset(this)       
            
            InitialObservation = double([randi([1 2]);randi([1 2]);randi([1 2])]);
            this.State = InitialObservation;
            
            % (optional) use notifyEnvUpdated to signal that the 
            % environment has been updated (e.g. to update visualization)
            notifyEnvUpdated(this);
        end
    end
    %% Optional Methods (set methods' attributes accordingly)
    methods               
        % Helper methods to create the environment
        % Discrete force 1 or 2
        function force = getForce(this,action)
           force = action;         
        end
        % update the action info based
          function updateActionInfo(this)
%              this.ActionInfo.Elements = this.ActionInfo.Elements;
          end
        
        % Reward function
        function Reward = getReward(this,parse)
            if this.IsDone==1
                Reward = sum(parse*this.RewardForGoodShaping);
            else
                Reward = sum(~parse*this.PenaltyForBadShaping);
            end          
        end            
            
        % (optional) Properties validation through set methods
        function set.State(this,state)
            validateattributes(state,{'numeric'},{'finite','real','vector','numel',3},'','State');
            this.State = double(state);
            notifyEnvUpdated(this);
        end
     
        function set.RewardForGoodShaping(this,val)
            validateattributes(val,{'numeric'},{'real','finite','scalar'},'','RewardForGoodShaping');
            this.RewardForGoodShaping = val;
        end
        function set.PenaltyForBadShaping(this,val)
            validateattributes(val,{'numeric'},{'real','finite','scalar'},'','PenaltyForBadShaping');
            this.PenaltyForBadShaping = val;
        end
    end
    
    methods (Access = protected)
        % (optional) update visualization everytime the environment is updated 
        % (notifyEnvUpdated is called)
        function envUpdatedCallback(this)
            plot(this.State)
            hold off
            XLimMode = 'auto';
            YLimMode = 'auto'; 
            
        end
    end
end

This Environment validates but when i start Training with the folowing DNN and Training Options:

%%
%Create the Agent: Deep Q-Network Agent
%A DQN agent approximates the long-term reward given observations and
%actions using a critiv value function represenntation. To create the
%critic, frist create a deep neural network with two inputs, the state and
%the action, and one output.
statePath =  [
    imageInputLayer([3 1], 'Normalization','none','Name','state')
    fullyConnectedLayer(24,'Name','CriticStateFC1')
    reluLayer('Name','CriticRelu1')
    fullyConnectedLayer(24,'Name','CriticStateFC2')];
actionPath = [
    imageInputLayer([3 1],'Normalization','none','Name','action')
    fullyConnectedLayer(24,'Name','CriticActionFC1')];
commonPath = [
    additionLayer(2,'Name','add')
    reluLayer('Name','CriticCommonRelu')
    fullyConnectedLayer(1,'Name','output')];
criticNetwork = layerGraph(statePath);
criticNetwork = addLayers(criticNetwork, actionPath);
criticNetwork = addLayers(criticNetwork, commonPath);
criticNetwork = connectLayers(criticNetwork,'CriticStateFC2','add/in1');
criticNetwork = connectLayers(criticNetwork,'CriticActionFC1','add/in2');
figure(1)
plot(criticNetwork)
%figure(2)
%hold on
%%
%Specify Options for the critic representation using
%rlRepresentationOptions
criticOpts = rlRepresentationOptions('LearnRate', 0.01, 'GradientThreshold',1, 'UseDevice',"gpu");
obsInfo=getObservationInfo(env);
actInfo=getActionInfo(env);
critic = rlQValueRepresentation(criticNetwork,obsInfo,actInfo,'Observation',{'state'},'Action',{'action'},criticOpts);
%%
%To Create a DDPG Agent with an Continuous action space.
actorNetwork = [
    imageInputLayer([3 1],'Normalization','none','Name','state')
    fullyConnectedLayer(3,'Name','action','BiasLearnRateFactor',1,'BiasInitializer','zeros','Bias',[0;0;0])];
    
actorOpts = rlRepresentationOptions('LearnRate',1e-04,'GradientThreshold',1);
actor = rlDeterministicActorRepresentation(actorNetwork,obsInfo,actInfo,'Observation',{'state'},'Action',{'action'},actorOpts);
%To create the DDPG agent, first specify the DDPG agent options using
%rlDDPGAgentoptions
agentOpts = rlDDPGAgentOptions(...
'SampleTime',1, ...
'TargetSmoothFactor',1e-3, ...
'ExperienceBufferLength',1e6, ...
'DiscountFactor',0.99, ...
'MiniBatchSize',32);
agentOpts.NoiseOptions.Variance = 0.3;
agentOpts.NoiseOptions.VarianceDecayRate = 1e-6;
%Then, create the DDPG Agent using the specified actor representation,
%critic representation and agent options.
agent = rlDDPGAgent(actor,critic,agentOpts);
trainOpts = rlTrainingOptions(...
    'MaxEpisodes', 500, ...
    'MaxStepsPerEpisode', 10, ...
    'Verbose', true, ...
    'Plots','training-progress',...
    'StopTrainingCriteria','EpisodeCount',...
    'StopTrainingValue',5);

i get a big error in the training process calculating the cumulative reward. Anyway there seem to be many problems in this code i can´t figure out completly using the given examples from the toolbox.

Error using rl.agent.AbstractPolicy/step (line 116)
Invalid input argument type or size such as observation, reward, isdone or loggedSignals.
Error in rl.env.MATLABEnvironment/simLoop (line 241)
                    action = step(policy,observation,reward,isdone);
Error in rl.env.MATLABEnvironment/simWithPolicyImpl (line 106)
                    [expcell{simCount},epinfo,siminfos{simCount}] = simLoop(env,policy,opts,simCount,usePCT);
Error in rl.env.AbstractEnv/simWithPolicy (line 70)
            [experiences,varargout{1:(nargout-1)}] = simWithPolicyImpl(this,policy,opts,varargin{:});
Error in rl.task.SeriesTrainTask/runImpl (line 33)
            [varargout{1},varargout{2}] = simWithPolicy(this.Env,this.Agent,simOpts);
Error in rl.task.Task/run (line 21)
            [varargout{1:nargout}] = runImpl(this);
Error in rl.task.TaskSpec/internal_run (line 159)
            [varargout{1:nargout}] = run(task);
Error in rl.task.TaskSpec/runDirect (line 163)
            [this.Outputs{1:getNumOutputs(this)}] = internal_run(this);
Error in rl.task.TaskSpec/runScalarTask (line 187)
                runDirect(this);
Error in rl.task.TaskSpec/run (line 69)
                runScalarTask(task);
Error in rl.train.SeriesTrainer/run (line 24)
            run(seriestaskspec);
Error in rl.train.TrainingManager/train (line 291)
            run(trainer);
Error in rl.train.TrainingManager/run (line 160)
            train(this);
Error in rl.agent.AbstractAgent/train (line 54)
TrainingStatistics = run(trainMgr);
Error in DQN_Agent_for_LaserSpectrum_optimization (line 124)
trainingStats = train(agent,env,trainOpts);
Caused by:
    Error using rl.agent.AbstractPolicy/step (line 103)
    Error setting property 'CumulativeReward' of class 'rl.util.EpisodeInfo'. Value must be a scalar.

So i have many questions. First how to fix this error?

Second is such an agent even possible if my system is doing its work in 1 timestep. Like getting the observation vector and then with choosing one action vector reaching maximum reward?

If Yes which kind of agent maybe the best?

Third: How can i scale up the example to a much bigger vector and how can i shrink the action space maybe with constraints?

Fourth: how can i define one action which is a 1D vector and than define the range each element of the vector can have?In my case i wld have a n x 1 vector and each element can be in the range between 0-35.

Fifth: If my model is fine in simulation, how can i create an agent which works with a "real" hardware environment? Do i ve to write the environment in a way that i coontrols the hardware? Is there an example how this may work?

Thanks in advance for the answers and sry for the very long question.

best regards

Kai

2 comentarios
Mostrar NingunoOcultar Ninguno

Kai Tybussek el 2 de Jul. de 2020

and how do define UpdateActionInfo within the environment when the actionspace is continuous?

Fangyuan Chang el 31 de Oct. de 2020

have the same issue in the defination of UpdateActionInfo...have you figured it out? Thx!

Iniciar sesión para comentar.

Iniciar sesión para responder a esta pregunta.

Answer 1

Emmanouil Tzorakoleftherakis el 2 de Jul. de 2020

1
Enlazar

Enlace directo a esta respuesta

https://la.mathworks.com/matlabcentral/answers/558376-how-to-create-an-custom-reinforcement-learning-environment-ddpg-agent#answer_460084

Abrir en MATLAB Online

Hi Kai,

What the very first error is telling you is that there is an issue with the dimensions of either your observation, reward, isdone or loggedSignals. In fact if you check the lines

% Check terminal condition                                    
parse=(Observation==[1 1 1]);
IsDone=all(ismember(parse,1));
this.IsDone=IsDone;

in your environment you will see that you are assigning a vector to IsDone, but you IsDone is supposed to be a scalar. I changed it to a scalar and training started properly (cannot comment on the other hyperparameters of the problem).

Some more answers to your questions:

Second is such an agent even possible if my system is doing its work in 1 timestep. Like getting the observation vector and then with choosing one action vector reaching maximum reward?

If Yes which kind of agent maybe the best?

Not sure what you mean here. The time step and agent sample time are determined on a case by case depending on the problem you work with.

Third: How can i scale up the example to a much bigger vector and how can i shrink the action space maybe with constraints?

Reinforcement learning does not typically consider hard constraints in the problem formulation, so if you have constraints in your problem you would probably need to treat them as soft and add penalties in your reward signal if they are violated.

Fourth: how can i define one action which is a 1D vector and than define the range each element of the vector can have?In my case i wld have a n x 1 vector and each element can be in the range between 0-35.

actInfo = rlNumericSpec([n 1],'LowerLimit',0,'UpperLimit',35);%for continous action spaces

See this example if your inputs are discrete.

Fifth: If my model is fine in simulation, how can i create an agent which works with a "real" hardware environment? Do i ve to write the environment in a way that i coontrols the hardware? Is there an example how this may work?

There is no out-of-the-box functionality for this yet so you would have to implement the communication part yourself (but we are actively working on this).

and how do define UpdateActionInfo within the environment when the actionspace is continuous?

For the UpdateActionInfo question, this is only called once in the constructor of the environment class so it is certainly not necessary to implement if you set up the actionspace otherwise.

Hope this helps

9 comentarios
Mostrar 7 comentarios más antiguosOcultar 7 comentarios más antiguos

Kai Tybussek el 3 de Jul. de 2020

Editada: Kai Tybussek el 3 de Jul. de 2020

ok its always happening when the first minibatch is finished. when i do 5000 episodes with 1 step and a batchsize of 512 its happening after 511 episodes

Error using rl.agent.AbstractPolicy/step (line 116)

Invalid input argument type or size such as observation, reward, isdone or loggedSignals.

Error in rl.env.MATLABEnvironment/simLoop (line 241)

action = step(policy,observation,reward,isdone);

Error in rl.env.MATLABEnvironment/simWithPolicyImpl (line 106)

[expcell{simCount},epinfo,siminfos{simCount}] = simLoop(env,policy,opts,simCount,usePCT);

Error in rl.env.AbstractEnv/simWithPolicy (line 70)

[experiences,varargout{1:(nargout-1)}] = simWithPolicyImpl(this,policy,opts,varargin{:});

Error in rl.task.SeriesTrainTask/runImpl (line 33)

[varargout{1},varargout{2}] = simWithPolicy(this.Env,this.Agent,simOpts);

Error in rl.task.Task/run (line 21)

[varargout{1:nargout}] = runImpl(this);

Error in rl.task.TaskSpec/internal_run (line 159)

[varargout{1:nargout}] = run(task);

Error in rl.task.TaskSpec/runDirect (line 163)

[this.Outputs{1:getNumOutputs(this)}] = internal_run(this);

Error in rl.task.TaskSpec/runScalarTask (line 187)

runDirect(this);

Error in rl.task.TaskSpec/run (line 69)

runScalarTask(task);

Error in rl.train.SeriesTrainer/run (line 24)

run(seriestaskspec);

Error in rl.train.TrainingManager/train (line 291)

run(trainer);

Error in rl.train.TrainingManager/run (line 160)

train(this);

Error in rl.agent.AbstractAgent/train (line 54)

TrainingStatistics = run(trainMgr);

Error in RFL_Agents_for_LaserSpectrum_optimization (line 133)

trainingStats = train(agent,env,trainOpts);

Caused by:

Error using rl.representation.rlAbstractRepresentation/gradient (line 181)

Unable to compute gradient from representation.

Unable to evaluate the loss function. Check the loss function and ensure it runs successfully.

Error using rl.representation.rlAbstractRepresentation/validateInputData (line 500)

Input data dimensions must match the dimensions specified in the corresponding observation and action info specifications.

Kai Tybussek el 8 de Jul. de 2020

i updated to Update 3 but still got an error when i train with GPU and no error with CPU.

"InitialObservation = double([randi([1 2]);randi([1 2]);randi([1 2])]);"

i think this may cause trouble in gpu calculation. When i set breakpoints one gpu array is single the other array is double.

Emmanouil Tzorakoleftherakis el 8 de Jul. de 2020

I think it would be better to contact technical support at this point and provide the exact reproduction model and error you are seeing. They would be able to get in touch with the development team if necessary.

Iniciar sesión para comentar.

How to create an custom Reinforcement Learning Environment + DDPG agent

2 comentarios
Mostrar NingunoOcultar Ninguno

Respuesta aceptada

9 comentarios
Mostrar 7 comentarios más antiguosOcultar 7 comentarios más antiguos

Más respuestas (0)

Ver también

Categorías

Etiquetas

Community Treasure Hunt

How to create an custom Reinforcement Learning Environment + DDPG agent

2 comentarios Mostrar NingunoOcultar Ninguno

Respuesta aceptada

9 comentarios Mostrar 7 comentarios más antiguosOcultar 7 comentarios más antiguos

Más respuestas (0)

Ver también

Categorías

Etiquetas

Community Treasure Hunt

2 comentarios
Mostrar NingunoOcultar Ninguno

9 comentarios
Mostrar 7 comentarios más antiguosOcultar 7 comentarios más antiguos