How to close client node when running a parpool job across multiple nodes of a cluster using slurm scheduler integration

3 visualizaciones (últimos 30 días)

James Heald el 22 de Mzo. de 2020

0
Enlazar

Enlace directo a esta pregunta

https://la.mathworks.com/matlabcentral/answers/512180-how-to-close-client-node-when-running-a-parpool-job-across-multiple-nodes-of-a-cluster-using-slurm-s

Comentada: Edric Ellis el 24 de Mzo. de 2020

I have access to a cluster of nodes. Each node has 64 physical cores. My goal is to run a job in parallel across 2 nodes (128 cores). To do this I create a remote cluster profile that utilises direct slurm integration. I first submit a job script to slurm:

#!/bin/bash

#SBATCH -A account

#SBATCH -p partition

#SBATCH -t 12:00:00

#SBATCH -J job

module purge

module load rhel7/default-peta4 matlab

matlab -nosplash -nodisplay -nodesktop -r "EmitCue(${1},${2},'${3}'); quit" # > matoutfile

which calls a matlab script (EmitCue.m). This matlab script requests the resources (a pool of workers):

clusterKNL = parallel.cluster.Slurm;

clusterKNL.NumWorkers = 128;

clusterKNL.NumThreads = 1;

clusterKNL.JobStorageLocation = JobStorageLocation;

clusterKNL.SubmitArguments = '--time=12:00:00 -A account -p partition -N 2 --ntasks-per-node=64 --cpus-per-task=1 -J job'

parpool(clusterKNL,128)

and then goes on to performs the desired computations.

The call to parpool in EmitCue.m itself submits a job to Slurm, which results in the required resources being allocated (in my case 2 nodes with 64 cores each, giving 128 nodes). Note, however, that there are now 2 jobs running simultaneously - the first job (job 1) that was used to open the matlab script EmitCue.m, and the second job (job 2) that was used inside EmitCue.m to request acess to 2 nodes. Now here is my problem. It is only possible for me to request entire nodes on the cluster; I can't request a subset of cores on a node. Hence, when I run the above code I end up with an entire node (I think this is called the 'client') of 64 cores being used just to request 2 more nodes and delegate the job across the 128 workers on those nodes (job 2). None of the cores on the client contribute to the worker pool. This is hugely wasteful. I have limited computing credits that are measured in terms of hours spent on a node so I want to avoid this.

Thanks in advance

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

Iniciar sesión para comentar.

Iniciar sesión para responder a esta pregunta.

Respuesta aceptada

Edric Ellis el 23 de Mzo. de 2020

0
Enlazar

Enlace directo a esta respuesta

https://la.mathworks.com/matlabcentral/answers/512180-how-to-close-client-node-when-running-a-parpool-job-across-multiple-nodes-of-a-cluster-using-slurm-s#answer_421429

You would be better off if you can submitting only a single job to the cluster, have that job be a batch job with the Pool option specified.

This means you need to set up the cluster profile so that it works from your client, at which point you can do this:

clus = parcluster('MyRemoteSlurmCluster');
job = batch(clus, @performComputations, 1, {inputArgs}, 'Pool', 127);

Where performComputations is your function that uses parfor or whatever. Specifying 127 for the Pool parameter means that the cluster will schedule a total of 128 worker processes - one for the leading worker, and 127 for the parallel pool attached to that leading worker.

9 comentarios
Mostrar 7 comentarios más antiguosOcultar 7 comentarios más antiguos

James Heald el 24 de Mzo. de 2020

Editada: James Heald el 24 de Mzo. de 2020

I'm not sure I understand where I should be calling Diary and accessing the job object. I have 2 matlab scripts. Script 1: I call batch and then exit. Script 2 is performComputations.m (which batch itself calls) which does the actual work. I assume I should call 'diary(job_object)' in script 1 after calling batch but before exit? Should I be trying to access the job object in script 2 (performComputations.m) at the very end of the script to see the command-window output? Is Diary a field of fetchOutputs that I can save to a file to view offline?

If I just call 'diary(job_object,filename)'in script 1 after calling batch but before exit, should I find a file called filename with the command-window output once the job has finished?

Edric Ellis el 24 de Mzo. de 2020

Once your batch job is complete, you somehow need to get back to the job object it created. If your desktop machine can see the appropriate JobStorageLocation, then that would be sufficient. For example, you might need to do something a bit like this: https://www.mathworks.com/help/parallel-computing/examples/run-batch-job-and-access-files-from-workers.html#RunBatchJobAndAccessFilesFromWorkersExample-4 to find the job that you submitted. From there, you can do:

job = findJob(clus, 'ID', 3); % or whatever
outputs = fetchOutputs(job); % get the output arguments of the function
diary(job) % display the command-window output.

Iniciar sesión para comentar.

Más respuestas (0)

Iniciar sesión para responder a esta pregunta.

Categorías

Parallel Computing Parallel Computing Toolbox Batch Processing Detailed Job and Task Control Job and Task Creation

Más información sobre Job and Task Creation en Help Center y File Exchange.

Productos

Parallel Computing Toolbox

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by