Strange core usage when running Slurm jobs
8 visualizaciones (últimos 30 días)
Mostrar comentarios más antiguos
Fredrik P
el 16 de Mzo. de 2024
Comentada: Damian Pietrus
el 21 de Mzo. de 2024
I'm trying to run jobs on an HPC cluster using Slurm, but I run into problems both when I'm running interactive jobs and when I'm submitting batch jobs.
- When I run interactive jobs and I book one node, then I manage to use all of the node's 20 cores. But when I book more than one node for an interactive job, then the cores on the extra nodes are just left unused.
- When I run a batch job, then the job uses only one core per node.
Do you have any idea what I might be doing wrong?
1. I book my interactive job from the command prompt using the following commands:
interactive -A myAccountName -p devel -n 40 -t 0:30:00
module load matlab/R2023a
matlab
to submit a 30-minute 40-core job to the "devel" partition using my account (not actually called "myAccountName"), load the Matlab module and launch Matlab as an X application. Once in Matlab, I first choose the "Processes" parallel profile and second run the "Setup" and "Interactive" sections in the silly little script at the bottom of this question. In two separate terminal sessions, I then use
ssh MYNODEID
htop
where MYNODEID is either of the two nodes assigned to the interactive job. Then I see that the job uses all of the cores on one of the nodes and none of the cores on the second node.
2. To book my batch job, I load and launch Matlab from the command prompt using the following commands
module load matlab/R2023a
matlab
and then run the "Setup" and "Batch" sections in the silly little script at the bottom of this question. Using the same procedure as above, htop lets me see that the job uses two cores (one on each node) and leaves the remaining 38 cores (19 on each node) unused.
Silly little script
%% Setup
clear;
close all;
clc;
N = 1000; % Length of fmincon vector
%% Interactive
x = solveMe(randn(1, N));
%% Batch
Cluster = parcluster('rackham R2023a');
Cluster.AdditionalProperties.AccountName = 'myAccountName';
Cluster.AdditionalProperties.QueueName = 'devel';
Cluster.AdditionalProperties.WallTime = '0:30:00';
Cluster.batch( ...
@solveMe, ...
0, ...
{}, ...
'pool', 39 ...
); % Submit a 30-minute 40-core job to the "devel" partition using my account (not actually called "myAccountName")
%% Helper functions
function A = slowDown()
A = randn(5e3);
A = A + randn(5e3);
end
function x = solveMe(x0)
opts = optimoptions( ...
"fmincon", ...
"MaxFunctionEvaluations", 1e6, ...
"UseParallel", true ...
);
x = fmincon( ...
@(x) 0, ...
x0, ...
[], [], ...
[], [], ...
[], [], ...
@(x) nonlinearConstraints(x), ...
opts ...
);
function [c, ceq] = nonlinearConstraints(x)
c = [];
A = slowDown();
ceq = 1 ./ (1:numel(x)) - cumsum(x);
end
end
0 comentarios
Respuesta aceptada
Damian Pietrus
el 19 de Mzo. de 2024
Based on your code, it looks like you have correctly configured a cluster profile to submit a job to MATLAB Parallel Server. In this case, your MATLAB client will always submit a secondary job to the scheduler. It is in this secondary job that you should request the bulk of your resources. As an example, on the cluster login node you should only ask for a few cores (enough to run your MATLAB serial code), as well as a longer WallTime:
% Two cores, 1 hour WallTime
interactive -A myAccountName -p devel -n 2 -t 1:00:00
module load matlab/R2023a
matlab
Next, you should continue to use the AdditionalProperties fields to shape your "inner" job:
%% Batch
Cluster = parcluster('rackham R2023a');
Cluster.AdditionalProperties.AccountName = 'myAccountName';
Cluster.AdditionalProperties.QueueName = 'devel';
Cluster.AdditionalProperties.WallTime = '0:30:00';
When you call the MATLAB batch command, this is where you can then request the total amount of cores that you would like your parallel code to run on:
myJob40 = Cluster.batch(@solveMe, 0, {},'pool', 39);
myJob100 = Cluster.batch(@solveMe, 0, {},'pool', 99);
Notice that since this submits a completely separate job to the scheduler queue, you can choose a pool size larger than you requested in your 'interactive' CLI command. Also notice that the Cluster.AdditionalProperties WallTime value is shorter than the 'interactive' value. This is to account for the time that the inner job may wait in the queue.
Long story short -- when you call batch or parpool within a MATLAB session that has a Parallel Server cluster profile setup, it will submit a secondary job to the scheduler that can have its own separate resources. You can verify this by manually veiwing the scheduler's job queue.
Please let me know if you have any further questions!
4 comentarios
Damian Pietrus
el 21 de Mzo. de 2024
Thanks for including that -- It looks like your integration scripts are from around 2018. Since they are a bit out of date, they don't include some changes that will hopefully fix the core binding issue you're experiencing. I'll reach out to you directly, but for anyone else that finds this post in the future, you can get an updated set of integration scripts here:
Más respuestas (0)
Ver también
Categorías
Más información sobre Third-Party Cluster Configuration en Help Center y File Exchange.
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!