I'm starting to think this has to do with what parfor does, which is calling simply calling omp. Isn't something more needed than omp for multiple nodes, like actual mpi calls? In that case, it seems matlab is incapable of running on more than 1 node, even with parallel server, or does PS somehow extend parfor (and thus omp)?
Cluster uses only one node, even though 5 nodes Running (parfor)
5 visualizaciones (últimos 30 días)
Mostrar comentarios más antiguos
Science Machine
el 4 de Oct. de 2022
Comentada: Raymond Norris
el 17 de Oct. de 2022
I am running on our cluster an .m file which contains parfor loops.
Using a cluster config file (below), I request 5 nodes, each nodes has 48 procs, which I open with parpool(5*48) workers.
I am able to logon to each node individually and monitor its use with htop.
I see that on each node indeed matlab is running. But it seems that only 1 node is being used by looking at ram usage. If I increase the parfor loop, the process crashes.
I was looking at other posts, that mention setting a parameter LASTN = maxNumCompThreads('automatic'). This is equal to 30 before setting anything. I have tried setting this equal to the total number of workers, eg numProcs*numNodes=48*5.
- Here is the node that's being used: 111-150GB being used. When the parfor loop is too big, that node's ram maxes out at 187 and matlab crashes. Also, perhaps noteworthy this is not the head node.
2. Here is an example of an unused node: Note that 34 G is the total ram usage when matlab is idle
This is the config file I run, before I run my job. To confirm stuff before the run I can do eg c.AdditionalProperties.NumNodes and ... .numWorkers and get the expected amount (numnodes=numworkers=5*48)
clear;clc
allNames = parallel.clusterProfiles()
rehash toolbox
configCluster
%/usr/local/MATLAB/R2019
c=parcluster;
c.AdditionalProperties.WallTime = '24:20:0';
c.AdditionalProperties.QueueName = 'CAC48M192_L';
c.AdditionalProperties.AccountName = 'redacted';
nn = 5;
pp = 48;
c.AdditionalProperties.NumNodes = nn;
c.AdditionalProperties.ProcsPerNode = pp;
c.AdditionalProperties.NumWorkers = pp*nn;
c.saveProfile
%end
Respuesta aceptada
Science Machine
el 17 de Oct. de 2022
2 comentarios
Raymond Norris
el 17 de Oct. de 2022
In this case, totalProcCount tells parfor the number of workers to allocate to the loop. You might do this in the case where 100 workers are running, but you only want 50 to be used. If you don't specify this, all 100 workers maybe used.
Take the following example of running a local pool, which clearly can't run across >1 node.
local = parcluster("local");
pool = local.parpool(4);
tic
parfor (idx = 1:16,2)
pause(2)
end
toc
Run this and you'll see it takes ~16.5s (instead of the 8+s). You can verify that all the workers are on the same node (which of course they have to be)
spmd, [~,hn] = system('hostname'), end
This is just to explain that totalProcs doesn't control this.
Science Machine
el 17 de Oct. de 2022
Editada: Science Machine
el 17 de Oct. de 2022
Más respuestas (1)
Raymond Norris
el 6 de Oct. de 2022
Starting a parpool will create very little activity for the workers. It's only once you run a parallel construct (e.g., parfor) will any work be given to the workers.
Increasing the parfor loop shouldn't crash the workers. What are working are you doing? Did you request enough memory (c.AdditionalProperties.MemUsage)? Or do you mean increasing the parallel pool?
I can't tell where you're running MATLAB (your machine, head node, compute node?), but I suspect if the maxNumCompThreads is 30, and there are 48 cores per node, you're not running MATLAB on the compute node. You don't want thread count to be greater than the number of cores on a node (since they won't spawn across nodes). To set the thread count, call
c.NumThreads = 48;
This will then force each worker to run on its own node.
Workers are capable of running more than one node. Post a bit more context and we can figure out what's going on.
- Are you running MATLAB on your machine, head node, or compute node?
- Are you sbumitting jobs with parpool or batch?
- What size pool are you starting (i.e., parpool(X) or batch(.., 'Pool', X);
- Can you provide the code you're trying to run?
5 comentarios
Raymond Norris
el 17 de Oct. de 2022
Slurm is telling you that your combonation of cores, threads, and nodes is not available. By the looks of it, I'd say someone on our team helped implement when you're doing (i.e., using configCluster). If you're interested, contact Technical Support (support@mathworks.com) with your contact info. They in turn can get a hold of me. We can work this out offline.
Ver también
Categorías
Más información sobre Parallel Computing Fundamentals en Help Center y File Exchange.
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!