Parallel optimization hanging on getCompleteIntervals

39 visualizaciones (últimos 30 días)
Samuel Nathan
Samuel Nathan el 30 de Mzo. de 2020
Comentada: 宇龙 el 19 de Nov. de 2022
I'm using a cloudcentre cluster with parpool and the optimization runs until suddenly hanging. The code does not always hang but does 9/10 times. Suspected deadlock but I have made sure each worker has the files it requires. After it hangs I can exit with ctr-c but i have to restart the server in order to get the optimization running again else it hangs waiting for the pool to be ready.
Init code
c = parpool('AttachedFiles',{'OptimiseModel.m','decreasing_amplitude_01.mat','ArmModelV2.slx','MapData.m','sim_model_test.m','slprj'});
mpiSettings('DeadlockDetection','on')
mpiSettings('MessageLogging','on')
mpiSettings('MessageLoggingDestination','CommandWindow')
My obj function optimise model runs a simulink model with passed values from the particle swarm algorithm
if init == true
simIn = MapData;
init = false;
end
simOut = sim(simIn);
RMSE = simOut.get('rmse');
each worker has it's own copy of simin and the init stuff is a hack to allow the fuction to be evaluated by the client instance which happens once at the beginning of the particleswarm algo. (Don't know why)
spmd
model = load_system('ArmModelV2');
set_param(model, 'SimulationCommand', 'stop')
set_param(model,'FastRestart','on');
set_param(model,'SimulationMode','Accelerator');
set_param(model,'AccelVerboseBuild','on')
simIn = MapData();
end
~~~~~~~~~~~~
fun = @(x)OptimiseModel(init,MCV_B,x(1),x(2),x(3),x(4),x(5),VMO_B,x(6),x(7),x(8),MCV_T,x(9),x(10),x(11), ...
x(12),x(13),VMO_T,x(14),x(15),x(16),x(17),x(18),x(19),x(20),x(21),x(22),x(23),x(24),x(25),x(26),x(27));
options = optimoptions('particleswarm','UseParallel',true,'UseVectorized',false,'PlotFcn','pswplotbestf');
[x,rmse_best] = particleswarm(fun,27,lb,ub,options);
All looks good until out of nowhere the workers stop running the obj function and the code hangs here which is part of the src for remoteparfor:
while isempty(r)
assert(obj.NumIntervalsInController > 0, ...
'Internal error in PARFOR - no intervals to retrieve.');
r = q.poll(1, timeUnitSeconds);
obj.displayOutput();
WHY? Can anybody help me? Can provide more of the code if required (I didn't include all as most is irrelevent - at least i thought so). Any suggestions on further debugging strategys would be great also.
Thanks alot!
EDIT Code works in serial
  1 comentario
Samuel Nathan
Samuel Nathan el 1 de Abr. de 2020
Further Investigation is showing that a number of workers are crashing even when modifying the particleswarm parfor with instructions from https://uk.mathworks.com/help/simulink/ug/not-recommended-using-sim-function-within-parfor.html#brsk7nj looking at a way to restart workers/cancel and restart jobs on workers.

Iniciar sesión para comentar.

Respuestas (1)

Edric Ellis
Edric Ellis el 31 de Mzo. de 2020
A few notes:
  1. The deadlock detection is for labSend and labReceive. Your parallel code is using parfor. There is no way that parfor can encounter a cyclic deadlock because the workers operating on the body of the loop do not communicate with each other (except possibly via the file system). (When writing labSend and labReceive code inside spmd, you can write a cyclic deadlock, and that's what the deadlock detection setting can help you discover).
  2. Your mpiSettings calls should be run on the workers - i.e. inside an spmd block. (But see point (1) - I don't think they're relevant here)
  3. The method getCompleteIntervals is a completely normal part of parfor operation - this is where the client waits for the workers to return their results. The only thing that you can deduce from the client waiting at that point is that the workers haven't finished their parfor loop iterations yet
  4. I am suspicious of your use of accelerated simulation mode. I'm not an expert, but I think that this might possibly cause the workers to interfere with one another via the filesystem.
Here's what I would try: try running with a parallel pool of size 1. If that fixes things, then perhaps the workers are interfering with one another via the file system.
You could force the workers to temporarily change to a unique directory prior to running the simulations by doing something like this:
% force the workers into a unique directory
spmd
myTempDir = tempname(); % tempname returns a globally unique name
oldWd = pwd();
mkdir(myTempDir);
cd(myTempDir);
end
% ... run stuff in parfor
particleswarm();
% Put the workers back into the original working directory
spmd
cd(oldWd);
end
But that's a complete stab in the dark without having reproduction steps that I can try out.
  5 comentarios
Jinsu Kim
Jinsu Kim el 31 de Mayo de 2021
I also encountered same problem. GA optimization with parallel copmuting (using 'parfor') stucked in the message belows:
r = q.poll(1, timeUnitSeconds);
obj.displayOutput();
Is your problem resolved now?
宇龙
宇龙 el 19 de Nov. de 2022
I think you may try to use less logical processors

Iniciar sesión para comentar.

Categorías

Más información sobre Startup and Shutdown en Help Center y File Exchange.

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by