Parallel optimization hanging on getCompleteIntervals

Question

0 votos

I'm using a cloudcentre cluster with parpool and the optimization runs until suddenly hanging. The code does not always hang but does 9/10 times. Suspected deadlock but I have made sure each worker has the files it requires. After it hangs I can exit with ctr-c but i have to restart the server in order to get the optimization running again else it hangs waiting for the pool to be ready.

Init code

c = parpool('AttachedFiles',{'OptimiseModel.m','decreasing_amplitude_01.mat','ArmModelV2.slx','MapData.m','sim_model_test.m','slprj'});
mpiSettings('DeadlockDetection','on')
mpiSettings('MessageLogging','on')
mpiSettings('MessageLoggingDestination','CommandWindow')

My obj function optimise model runs a simulink model with passed values from the particle swarm algorithm

if init == true
simIn = MapData;
init = false;
end
simOut = sim(simIn);
RMSE = simOut.get('rmse');

each worker has it's own copy of simin and the init stuff is a hack to allow the fuction to be evaluated by the client instance which happens once at the beginning of the particleswarm algo. (Don't know why)

spmd
model = load_system('ArmModelV2');
set_param(model, 'SimulationCommand', 'stop')
set_param(model,'FastRestart','on');
set_param(model,'SimulationMode','Accelerator');
set_param(model,'AccelVerboseBuild','on')
simIn = MapData();
end
~~~~~~~~~~~~
fun = @(x)OptimiseModel(init,MCV_B,x(1),x(2),x(3),x(4),x(5),VMO_B,x(6),x(7),x(8),MCV_T,x(9),x(10),x(11), ...
    x(12),x(13),VMO_T,x(14),x(15),x(16),x(17),x(18),x(19),x(20),x(21),x(22),x(23),x(24),x(25),x(26),x(27));
options = optimoptions('particleswarm','UseParallel',true,'UseVectorized',false,'PlotFcn','pswplotbestf');
[x,rmse_best] = particleswarm(fun,27,lb,ub,options);

All looks good until out of nowhere the workers stop running the obj function and the code hangs here which is part of the src for remoteparfor:

                while isempty(r)
                    assert(obj.NumIntervalsInController > 0, ...
                           'Internal error in PARFOR - no intervals to retrieve.');
                    r = q.poll(1, timeUnitSeconds);
                    obj.displayOutput();
                   

WHY? Can anybody help me? Can provide more of the code if required (I didn't include all as most is irrelevent - at least i thought so). Any suggestions on further debugging strategys would be great also.

Thanks alot!

EDIT Code works in serial

1 comentario
Mostrar -1 comentarios más antiguos Ocultar -1 comentarios más antiguos

Samuel Nathan el 1 de Abr. de 2020

Further Investigation is showing that a number of workers are crashing even when modifying the particleswarm parfor with instructions from https://uk.mathworks.com/help/simulink/ug/not-recommended-using-sim-function-within-parfor.html#brsk7nj looking at a way to restart workers/cancel and restart jobs on workers.

Iniciar sesión para comentar.

Iniciar sesión para responder a esta pregunta.

Follow Question

Answer 1

Edric Ellis el 31 de Mzo. de 2020

Abrir en MATLAB Online

1 voto

A few notes:

The deadlock detection is for labSend and labReceive. Your parallel code is using parfor. There is no way that parfor can encounter a cyclic deadlock because the workers operating on the body of the loop do not communicate with each other (except possibly via the file system). (When writing labSend and labReceive code inside spmd, you can write a cyclic deadlock, and that's what the deadlock detection setting can help you discover).
Your mpiSettings calls should be run on the workers - i.e. inside an spmd block. (But see point (1) - I don't think they're relevant here)
The method getCompleteIntervals is a completely normal part of parfor operation - this is where the client waits for the workers to return their results. The only thing that you can deduce from the client waiting at that point is that the workers haven't finished their parfor loop iterations yet
I am suspicious of your use of accelerated simulation mode. I'm not an expert, but I think that this might possibly cause the workers to interfere with one another via the filesystem.

Here's what I would try: try running with a parallel pool of size 1. If that fixes things, then perhaps the workers are interfering with one another via the file system.

You could force the workers to temporarily change to a unique directory prior to running the simulations by doing something like this:

% force the workers into a unique directory
spmd
    myTempDir = tempname(); % tempname returns a globally unique name
    oldWd = pwd();
    mkdir(myTempDir);
    cd(myTempDir);
end
% ... run stuff in parfor
particleswarm();
% Put the workers back into the original working directory
spmd
    cd(oldWd);
end

But that's a complete stab in the dark without having reproduction steps that I can try out.

5 comentarios
Mostrar 3 comentarios más antiguos Ocultar 3 comentarios más antiguos

Jinsu Kim el 31 de Mayo de 2021

I also encountered same problem. GA optimization with parallel copmuting (using 'parfor') stucked in the message belows:

r = q.poll(1, timeUnitSeconds);

obj.displayOutput();

Is your problem resolved now?

宇龙 el 19 de Nov. de 2022

I think you may try to use less logical processors

Iniciar sesión para comentar.

Parallel optimization hanging on getCompleteIntervals

1 comentario
Mostrar -1 comentarios más antiguos Ocultar -1 comentarios más antiguos

Respuestas (1)

5 comentarios
Mostrar 3 comentarios más antiguos Ocultar 3 comentarios más antiguos

Categorías

Productos

Versión

Etiquetas

Community Treasure Hunt

Parallel optimization hanging on getCompleteIntervals

1 comentario Mostrar -1 comentarios más antiguos Ocultar -1 comentarios más antiguos

Respuestas (1)

5 comentarios Mostrar 3 comentarios más antiguos Ocultar 3 comentarios más antiguos

Categorías

Productos

Versión

Etiquetas

Ver también

Community Treasure Hunt

1 comentario
Mostrar -1 comentarios más antiguos Ocultar -1 comentarios más antiguos

5 comentarios
Mostrar 3 comentarios más antiguos Ocultar 3 comentarios más antiguos