Specify the parallel pool job timeout

Question

Maria el 1 de Oct. de 2021

0
Enlazar

Enlace directo a esta pregunta

https://la.mathworks.com/matlabcentral/answers/1464799-specify-the-parallel-pool-job-timeout

Comentada: Raymond Norris el 2 de Oct. de 2021

Hi,

I am running some tests using a remote cluster (no local cluster). I send my functions in batch mode. I know that the functions take a large execution time, around 2 / 4 hours. When I try to run, I get the message:

'The parallel pool job was cancelled because it failed to finish within the specified parallel pool job timout of 300 seconds'

I looked into the documentation how to change the default timeout time. The only way I could find is with the "wait" command, as

wait(job,"finished",18000);

However, I keep on getting the same error. How can I change the default parallel pool job timeout in the remote cluster?

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

Iniciar sesión para comentar.

Iniciar sesión para responder a esta pregunta.

Answer 1

Raymond Norris el 1 de Oct. de 2021

0
Enlazar

Enlace directo a esta respuesta

https://la.mathworks.com/matlabcentral/answers/1464799-specify-the-parallel-pool-job-timeout#answer_799694

Abrir en MATLAB Online

So you doing something like the following

cluster = parcluster;
job = cluster.batch(@mycode,...., 'Pool',size);

Then what you're suggesting is that let's say your code looked something like

function mycode
pause(10 * 60)
parfor idx = 1:N
    ...
end

On the cluster, the workers have a default timeout of 5 minutes and therefore errors out because you're running code (the pause) for 10 minutes before the workers are being used in the parfor.

'The parallel pool job was cancelled because it failed to finish within the specified parallel pool job timout of 300 seconds'

I tried quickly reproducing this with the local scheduler, but didn't (it shouldn't matter if I'm jusing local). How are you getting the error message? And which scheduler are you using MJS or generic (e.g. PBS)?

9 comentarios
Mostrar 7 comentarios más antiguosOcultar 7 comentarios más antiguos

Maria el 1 de Oct. de 2021

Editada: Maria el 1 de Oct. de 2021

Abrir en MATLAB Online

So, here is how my code looks like:

delete(gcp('nocreate'));
parallel.defaultClusterProfile('MatlabCluster')
c = parcluster();
N = 32;
job = batch(c,@compute_H_matrix,1,{large data inputs},'Pool',N-1);

The function "compute_H_matrix" has some if /while conditions, then it calls another funciton, "internal_compute_H". In the internal_compute_H, there are 2 for loops, and the outest one is a parfor. In these loops, I fill in the elements of the matrix H, with a call to another function (I know, it is an involved code), "compute_integrals_H", which finally calls a mex file "compute_Ihp_mex" (the mex is created in Linux, so compatible with the cluster). The reason for all the sub-calls is because the geometry of the problem is handled in each function based on some criteria identified from the input.

And after 5 minutes the job "fails" and, from the getReport(job.Tasks(1).Error), I get this error message. The scheduler is MJS

Maria el 1 de Oct. de 2021

Abrir en MATLAB Online

Aha, this is good to know!

But now I took away the distributed and I get

MATLAB worker exited with status 9 during task execution.

What does that mean?

Raymond Norris el 2 de Oct. de 2021

One of the workers crashed, possibly because of an out of memory issue. Email Technical Support (support@mathworks.com) and they can walk you through debug steps and getting the MJS log files to troubleshoot.

Iniciar sesión para comentar.

Specify the parallel pool job timeout

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

Respuestas (1)

9 comentarios
Mostrar 7 comentarios más antiguosOcultar 7 comentarios más antiguos

Ver también

Categorías

Etiquetas

Productos

Versión

Community Treasure Hunt

Specify the parallel pool job timeout

0 comentarios Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

Respuestas (1)

9 comentarios Mostrar 7 comentarios más antiguosOcultar 7 comentarios más antiguos

Ver también

Categorías

Etiquetas

Productos

Versión

Community Treasure Hunt

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

9 comentarios
Mostrar 7 comentarios más antiguosOcultar 7 comentarios más antiguos