Matlab2021b constantly fails to invoke parpool

8 visualizaciones (últimos 30 días)
nassos
nassos el 29 de Abr. de 2022
Comentada: Walter Roberson el 8 de Feb. de 2024
Hi,
I have a problem in Matlab 2021b while invoking parpool.
I have a script that in several parts of it (4 to be exact), I make use of parpool.
I invoke parpool using the following snipet:
test_p = gcp('nocreate')';
if isempty(test_p)
myPool = parpool('local',64);
end
While the first 3 parpools are opening without a problem, the 4th time the parpool crashes with the following error:
Error using parpool (line 146)
Parallel pool failed to start with the following error. For more detailed information, validate the profile 'local' in the Cluster Profile Manager.
Caused by:
Error using parallel.internal.pool.AbstractInteractiveClient>iThrowWithCause (line 305)
Failed to initialize the interactive session.
Error using parallel.internal.pool.AbstractInteractiveClient>iThrowIfBadParallelJobStatus (line 399)
The interactive communicating job failed with no message.
This unstable beahviour has happened multiple times and not only with this script.
Sometimes the parpool will open, some others it will crash.
To solve this, I always re-start Matlab and delete the ~/.matlab/local_cluster_jobs, but this is only a temporal remedy. The issue persists.
Running Validate in the Cluster Profile Manager, failed on the invocation of parpool, producing the following report:
Start Time: Fri Apr 29 01:19:58 EDT 2022
Finish Time: Fri Apr 29 01:20:17 EDT 2022
Running Duration: 0 min 19 sec
Description: Job ran with 64 workers.
Error Report:
Command Line Output:
Debug Log:
Stage: Pool job test (createCommunicatingJob)
Status: Passed
Start Time: Fri Apr 29 01:20:17 EDT 2022
Finish Time: Fri Apr 29 01:20:36 EDT 2022
Running Duration: 0 min 19 sec
Description: Job ran with 64 workers.
Error Report:
Command Line Output:
Debug Log:
Stage: Parallel pool test (parpool)
Status: Failed
Start Time: Fri Apr 29 01:20:36 EDT 2022
Finish Time: Fri Apr 29 01:24:10 EDT 2022
Running Duration: 3 min 34 sec
Description: Failed to initialize the interactive session.
Error Report: Failed to initialize the interactive session.
Caused by:
Error using parallel.internal.pool.AbstractInteractiveClient>iThrowIfBadParallelJobStatus (line 399)
The interactive communicating job failed with no message.
Command Line Output:
Debug Log: CLIENT LOG OUTPUT
Currently connected to: 1
Checking communicating job status.
Session failed to start when creating InteractiveClient. Error: Error using parallel.internal.pool.AbstractInteractiveClient>iThrowWithCause (line 305)
Failed to initialize the interactive session.
Error in parallel.internal.pool.AbstractInteractiveClient/start (line 142)
iThrowWithCause( 'parallel:convenience:FailedToInitializeInteractiveSession', err );
Error in parallel.internal.pool.AbstractClusterPool>iStartClient (line 831)
spmdInitialized = client.start(sessionBuildFcn, sessionInfo, numWorkers, cluster, ...
Error in parallel.internal.pool.AbstractClusterPool.hBuildPool (line 585)
iStartClient(client, sessionInfo, forceSpmdEnabled, cluster, supportRestart, argsList);
Error in parallel.internal.types.ValidationStages>iOpenPoolForCluster (line 456)
aPool = parallel.internal.pool.AbstractClusterPool.hBuildPool('Cluster', cluster, 'NumWorkers', numWorkers);
Error in parallel.internal.types.ValidationStages>@()iOpenPoolForCluster(runInfo)
Error in parallel.internal.types.ValidationStages>iCallWithNoHotlinks (line 336)
[varargout{1:nargout}] = fcn();
Error in parallel.internal.types.ValidationStages>iRunParpoolStage (line 247)
[commandWindowOutput, aPool] = evalc(iWrapForEvalc(openPoolFcn));
Error in parallel.internal.types.ValidationStages/run (line 68)
[eventData, runInfo] = obj.RunFunction(obj, runInfo);
Error in parallel.internal.validator.Validator/runValidationSuite (line 191)
[eventData, stageRunInfo] = currentStage.run(stageRunInfo);
Error in parallel.internal.validator.Validator/validate (line 103)
status = obj.runValidationSuite(profileName, suite);
Error in parallel.internal.ui.AbstractValidationManager/validate (line 36)
obj.Validator.validate(profileName, validationSuite);
Error in parallel.internal.ui.ValidationManager.validateProfile (line 36)
parallel.internal.ui.ValidationManager.getOrCreateInstance().validate(profileName, suite);
Caused by:
Error using parallel.internal.pool.AbstractInteractiveClient>iThrowIfBadParallelJobStatus (line 399)
The interactive communicating job failed with no message.
Failed to run the DisarmableOncleanup callback due to the following error:
Dot indexing is not supported for variables of this type.
What exactly is the problem here?
I am running Matlab on a Centos 7 machine with two "Intel(R) Xeon(R) Platinum 8352Y CPU @ 2.20GHz" (total of 64 physical - 128 logical cores) and 1.5TB of RAM.
I would really appreciate your help here as this is severely impacting my work.
Thank you in advance for your help and time!
  7 comentarios
Lin
Lin el 29 de Abr. de 2022
Hi, I have exactly the same problem with parpool in R2021b but on a CentOS 8 machine. No iptables/nftables is used. My script sometimes works but sometimes doesn't. It would be great if anyone could help to solve the problem. Thank you.
nassos
nassos el 29 de Abr. de 2022
I think I understand now your reasoning, thank you for your informative posts.
So is there a way to verify whether the Centos firewall is causing problems?
And how can this be remedied?
Thank you in advance for your help!

Iniciar sesión para comentar.

Respuestas (1)

Yash
Yash el 17 de En. de 2024
Hi,
When operating on Windows with MATLAB R2021b, users with non-ASCII characters in their usernames, such as extended ASCII characters, encounter difficulties with the local cluster's functionality. Specifically, starting parallel pools or running independent jobs using commands like parpool('local') leads to vague failure messages, such as "Failed to initialize the interactive session". This issue has been identified in the External Bug Report here: https://www.mathworks.com/support/bugreports/details/2619526
This issue was fixed in 2021b Update 3 and 2022a, further they have also provided a workaround in the bug report that you can try as a fix.
Hope this helps!
  5 comentarios
Yash
Yash el 8 de Feb. de 2024
Editada: Walter Roberson el 8 de Feb. de 2024
In the workaround, it is mentioned to use the "-c" startup flag to override the default license path of MATLAB to one that contains only ASCII characters. They have mentioned the steps for Windows. But at the end of EBR they have given this link: https://uk.mathworks.com/matlabcentral/answers/102520-how-do-i-change-the-license-search-location-for-matlab
This has the steps for Windows, MacOS and Linux for the same workaround.
Walter Roberson
Walter Roberson el 8 de Feb. de 2024
The workaround provided in the bug report is very OS specific. It is mostly accidental that it happens to mention a link that can be used for Linux.

Iniciar sesión para comentar.

Categorías

Más información sobre Parallel Computing Fundamentals en Help Center y File Exchange.

Productos


Versión

R2021b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by