Borrar filtros
Borrar filtros

Error when using parallel computing toolbox

44 visualizaciones (últimos 30 días)
Florian
Florian el 12 de Jul. de 2023
Comentada: Farhad el 6 de Oct. de 2023
Hi,
I am running matlab on a Linux cluster, unsing the parallel computing toolbox. While everything worked out so far, suddenly when I ran my code I get the following error:
Parallel pool failed to start with the following error. For more detailed information, validate the profile 'Processes' in the Cluster Profile Manager.
Error in samplescript_expdata (line 278)
parpool(8)
Caused by:
Error using parallel.internal.pool.SpfClientSession
An unexpected error occurred accessing a parallel pool. The underlying error was: Timeout binding/connecting to specified endpoints.
Thus I went to the cluster profile manager and ran the validation process. However, the last step fails as well reporting the following lengthy error message that I am going to paste below.
Does anyone have any idea what's wrong and how I can fix this?
Thank you in advance for your help!
Stage started at 2:20:01 PM. Completed in 0 min 30 sec.
Error Report: An unexpected error occurred accessing a parallel pool. The underlying error was: Timeout binding/connecting to specified endpoints.
Debug Log: CLIENT LOG OUTPUT
Session starting on cluster type: Local, with name: Processes
Session failed to start when creating InteractiveClient. Error: Error using parallel.internal.pool.SpfClientSession
An unexpected error occurred accessing a parallel pool. The underlying error was: Timeout binding/connecting to specified endpoints.
Error in parallel.internal.pool.CppBackedSession.buildInteractiveClient (line 397)
clientSession = parallel.internal.pool.SpfClientSession(bindEndpoint, ...
Error in parallel.internal.pool.AbstractClusterPool>@(c)parallel.internal.pool.CppBackedSession.buildInteractiveClient(c,sessionInfo) (line 839)
sessionBuildFcn = @(c) parallel.internal.pool.CppBackedSession.buildInteractiveClient(c, sessionInfo);
Error in parallel.internal.pool.AbstractInteractiveClient/start (line 100)
[session, connFcn] = sessionBuildFcn(clus);
Error in parallel.internal.pool.AbstractClusterPool>iStartClient (line 874)
spmdInitialized = client.start(sessionBuildFcn, sessionInfo, numWorkers, cluster, ...
Error in parallel.internal.pool.AbstractClusterPool.hBuildPool (line 636)
iStartClient(client, sessionInfo, forceSpmdEnabled, cluster, supportRestart, argsList);
Error in parallel.internal.types.ValidationStages>iOpenPoolForCluster (line 510)
aPool = parallel.internal.pool.AbstractClusterPool.hBuildPool('Cluster', cluster, ...
Error in parallel.internal.types.ValidationStages>@()iOpenPoolForCluster(runInfo)
Error in parallel.internal.types.ValidationStages>iCallWithNoHotlinks (line 391)
[varargout{1:nargout}] = fcn();
Error in parallel.internal.types.ValidationStages>iRunParpoolStage (line 302)
[commandWindowOutput, aPool] = evalc(iWrapForEvalc(openPoolFcn));
Error in parallel.internal.types.ValidationStages/run (line 74)
[eventData, runInfo] = obj.RunFunction(obj, runInfo);
Error in parallel.internal.validator.Validator/runValidationSuite (line 191)
[eventData, stageRunInfo] = currentStage.run(stageRunInfo);
Error in parallel.internal.validator.Validator/validate (line 103)
status = obj.runValidationSuite(profileName, suite);
Error in parallel.internal.ui.AbstractValidationManager/validate (line 36)
obj.Validator.validate(profileName, validationSuite);
Error in parallel.internal.ui.ValidationManager.validateProfile (line 36)
parallel.internal.ui.ValidationManager.getOrCreateInstance().validate(profileName, suite);
Session failed to start with message: Error using parallel.internal.pool.SpfClientSession
An unexpected error occurred accessing a parallel pool. The underlying error was: Timeout binding/connecting to specified endpoints.
Error in parallel.internal.pool.CppBackedSession.buildInteractiveClient (line 397)
clientSession = parallel.internal.pool.SpfClientSession(bindEndpoint, ...
Error in parallel.internal.pool.AbstractClusterPool>@(c)parallel.internal.pool.CppBackedSession.buildInteractiveClient(c,sessionInfo) (line 839)
sessionBuildFcn = @(c) parallel.internal.pool.CppBackedSession.buildInteractiveClient(c, sessionInfo);
Error in parallel.internal.pool.AbstractInteractiveClient/start (line 100)
[session, connFcn] = sessionBuildFcn(clus);
Error in parallel.internal.pool.AbstractClusterPool>iStartClient (line 874)
spmdInitialized = client.start(sessionBuildFcn, sessionInfo, numWorkers, cluster, ...
Error in parallel.internal.pool.AbstractClusterPool.hBuildPool (line 636)
iStartClient(client, sessionInfo, forceSpmdEnabled, cluster, supportRestart, argsList);
Error in parallel.internal.types.ValidationStages>iOpenPoolForCluster (line 510)
aPool = parallel.internal.pool.AbstractClusterPool.hBuildPool('Cluster', cluster, ...
Error in parallel.internal.types.ValidationStages>@()iOpenPoolForCluster(runInfo)
Error in parallel.internal.types.ValidationStages>iCallWithNoHotlinks (line 391)
[varargout{1:nargout}] = fcn();
Error in parallel.internal.types.ValidationStages>iRunParpoolStage (line 302)
[commandWindowOutput, aPool] = evalc(iWrapForEvalc(openPoolFcn));
Error in parallel.internal.types.ValidationStages/run (line 74)
[eventData, runInfo] = obj.RunFunction(obj, runInfo);
Error in parallel.internal.validator.Validator/runValidationSuite (line 191)
[eventData, stageRunInfo] = currentStage.run(stageRunInfo);
Error in parallel.internal.validator.Validator/validate (line 103)
status = obj.runValidationSuite(profileName, suite);
Error in parallel.internal.ui.AbstractValidationManager/validate (line 36)
obj.Validator.validate(profileName, validationSuite);
Error in parallel.internal.ui.ValidationManager.validateProfile (line 36)
parallel.internal.ui.ValidationManager.getOrCreateInstance().validate(profileName, suite);.
Failed to run the DisarmableOncleanup callback due to the following error:
Unrecognized method, property, or field 'pStopLabsAndDisconnect' for class 'parallel.internal.pool.InteractivePoolClient'.
  1 comentario
Edric Ellis
Edric Ellis el 13 de Jul. de 2023
I suggest you contact MathWorks support for help with this.

Iniciar sesión para comentar.

Respuestas (1)

Debadipto
Debadipto el 1 de Ag. de 2023
Hi Florian,
Please refer to the following article:
If this doesn't solve the issue, then please reach out to MathWorks support for help.
Regards,
Debadipto Biswas
  2 comentarios
Farhad
Farhad el 6 de Oct. de 2023
Hello,
i am also running Parallel Server on a cluster with SLURM as scheduler.
I created a generic profile as there is no shared storage between the users(clients) and the worker nodes on the validation process everything is running fine except the last step and i get the same error message posted above .
Unfortunately i cant acces the link you posted.
Error using parallel.internal.pool.SpfClientSession
An unexpected error occurred accessing a parallel pool. The underlying error was: Timeout binding/connecting to specified endpoints.
Farhad
Farhad el 6 de Oct. de 2023
Update:
I also have the https://github.com/mathworks/matlab-proxy in use on the cluster.
When i first start a session through the matlab-proxy and then use the Slurm Profile i can successfully run.
The Output shows:
Got clientEndpoint to connect to worker 1 with URL: tcp://tcpnodelay=node0:27583/protocol/catapult
Got clientEndpoint to connect to worker 2 with URL: tcp://tcpnodelay=node0:27370/protocol/catapult
Client starting to connect to workers
Connected to parallel pool with 2 workers.
But when i try the same from my windows matlab client it doesn't work.
I get the same output almost :
Got clientEndpoint to connect to worker 1 with URL: tcp://tcpnodelay=node0:27583/protocol/catapult
Got clientEndpoint to connect to worker 2 with URL: tcp://tcpnodelay=node0:27370/protocol/catapult
But then after while (timeout duration) the connection fails
Error using parallel.internal.pool.SpfClientSession
An unexpected error occurred accessing a parallel pool. The underlying error was: Timeout binding/connecting to specified endpoints.
As i am using a generic profile where i define AdditionalProperties ClusterHost i put in the public available domain name of the Login/Head-Node of the Cluster but the workers themself are not reachable from outside.
So i guess the failure of binding/connecting is due to the fact that there is private Cluster Network beyond the Login Node and the clientEndpoint is not proxied right to the Matlab-client machine(Desktop Windows).
Is there any known issue about it ? Or am i missing some configuration in the generic profile?
Thanks in advance
Best Regards
Farhad

Iniciar sesión para comentar.

Categorías

Más información sobre Cluster Configuration en Help Center y File Exchange.

Productos


Versión

R2023a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by