Problem allocating 24 workers using parpool

8 visualizaciones (últimos 30 días)
Geeta Garg
Geeta Garg el 18 de Sept. de 2020
Editada: Geeta Garg el 22 de Sept. de 2020
I am trying to execute a few jobs in parallel on a computing cluster. The maximum number of workers available per node is 24. When I execute the job, some jobs are allocated 24 workers, some are allocated 12, whereas some jobs are only running on one worker. What could possibly be the reason for this when the job script is the same for all jobs requesting for 24 workers per node? I looked at the job output for the job running on one worker and it says "
>> >> >> Starting parallel pool (parpool) using the 'local' profile ...
>> >> >> >> >> Starting parallel pool (parpool) using the 'local' profile ..."
This shows that the parpool did start for this job, however, only one worker was allocated. When I check the output of a job running on 24 workers, it says
">> >> >> Starting parallel pool (parpool) using the 'local' profile ...
Connected to the parallel pool (number of workers: 24)."
I am trying to run all jobs using all 24 workers for each job but I am not able to. Any help would be greatly appreciated.
  3 comentarios
Geeta Garg
Geeta Garg el 18 de Sept. de 2020
Editada: Geeta Garg el 18 de Sept. de 2020
Hi Raymond,
Thank you so much for your reply.
  • Are each of the MATLAB jobs running the same exact code?
Yes, each job is running the same code, however, the parameter inputs are different.
  • How are parallel pools starting up? Is parpool being called explicitly and if so with what size?
I use these commands to start the parpool:
parallel.defaultClusterProfile('local')
pc = parcluster('local');
parpool(pc, 24)
I explicitly request for 24 workers.
  • If parpool is not passed an argument, the default size is min(12,24) ==> 12 workers
This was also my understanding. I have some jobs running on 12 (default) workers but also some jobs with only 1 worker. The reason I know that the number of workers is 1 and not 12 is because it takes me much longer to see the output of jobs running on 1 worker.
Here is my slurm job script:
#!/bin/bash
#SBATCH -J REG_c_10_15
#SBATCH -p general
#SBATCH -o /N/slate/gegarg/REG_c_10_15.%j.txt
#SBATCH -e /N/slate/gegarg/REG_c_10_15.%j.err
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-node=24
#SBATCH --time=60:00:00
cd ~/RTW/MEI_state1_dep_c_o
module load matlab
matlab < script_m2.m
Thank you for your help.
Geeta
Geeta Garg
Geeta Garg el 22 de Sept. de 2020
Editada: Geeta Garg el 22 de Sept. de 2020
Hi, I just checked the error file for my jobs and I am constantly getting this error:
MATLAB numerical calculation framework version 2020a loaded.
{^HError using parallel.Cluster/parpool (line 86)
Parallel pool failed to start with the following error. For more detailed
information, validate the profile 'local' in the Cluster Profile Manager.
Caused by:
Error using parallel.internal.pool.InteractiveClient>iThrowWithCause (line
670)
Failed to locate and destroy old interactive jobs.
Error using parallel.Cluster/findJob (line 74)
Unknown type: concurrentconcurrent.
}^H
I followed some of the steps mentioned on this link to deal with this problem:
However, it appears that these solutions are not permanent.

Iniciar sesión para comentar.

Respuestas (1)

Mohammad Sami
Mohammad Sami el 18 de Sept. de 2020
Editada: Mohammad Sami el 18 de Sept. de 2020
Perhaps you can try
core = feature('numcores');
pool = parpool('local',core);
disp(['Pool has been started with Num Workers ' num2str(pool.NumWorkers)]);
Additionally you can try restarting the pool, if the workers are less then the number of cores.
retries = 0;
retry_limit = 3;
while (pool.NumWorkers < core)
retries = retries + 1;
disp('Restarting parallel pool');
delete(pool);
pool = parpool('local',core);
disp(['Pool has been started with Num Workers ' num2str(pool.NumWorkers)]);
if(retries >= retry_limit)
break;
end
end
  1 comentario
Geeta Garg
Geeta Garg el 18 de Sept. de 2020
Hi Mohammad,
Thanks for your reply.
I'll try your suggestion to see if it works.
Geeta

Iniciar sesión para comentar.

Categorías

Más información sobre Parallel Computing Fundamentals en Help Center y File Exchange.

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by