unable to pass "Parallel pool test" on remote Parallel server

4 visualizaciones (últimos 30 días)
Mike VanHorn
Mike VanHorn el 30 de Jul. de 2019
Comentada: Mike VanHorn el 15 de Jun. de 2023
I have set up MATLAB Parallel Server on our cluster. The MATLAB Job Scheduler is running on the headnode, and is able to talk to all of the workers on the compute nodes.
If I run MATLAB as a client on the headnode, I can pass all of the cluster profile validation tests. However, if I run the same tests on a different client machine (outside of the cluster), all of the tests pass except for the "Parallel pool test (parpool)". It fails after about 6 minutes with the following error:
-clip-
Error Report: Failed to initialize the interactive session.
Caused by:
Error using parallel.internal.pool.InteractiveClient>iThrowIfBadParallelJobStatus (line 789)
The interactive communicating job errored with the following message: Client unable to connect to worker. Check whether a firewall is blocking communication between the worker machine and the MATLAB client machine.
-clip-
I have the headnode set up so that it is nat-ing the cluster node traffic out of the cluster, so I am not sure why this isn't working. What is different between this test and the others, that this one would be failing when the others pass? It seems to me that in the previous tests, the client is talking to the MJS, and that is all, but in this case the workers need to talk directly to the client (according to the error message), which should be working (I can ssh from the worker machine to the client without issue). If the converse is true, and the client has to talk directly to the worker, I don't see how this would ever work in a cluster situation.
On another track, it may be that some ports are being blocked by filtering on our network switches. What ports do the workers need to be able to talk to the client?
Thank you for any help!

Respuestas (3)

Jason Ross
Jason Ross el 30 de Jul. de 2019
The required ports are documented here. Note that they are configurable in the mdce_def or mjs_def (.bat or .sh, dependingon platform) files in <matlabroot>/toolbox/distcomp/. There is some more detail in that file, as well.
It may be useful to set the hostname, IP, or ports explictly on the client host. To do that, use the pctconfig command in a fresh session of MATLAB before you attempt to run any other parallel commands. The client tries to "get this right" but in some cases you need to be explicit about the exact IP of the host and/or hostname to use.

Mike VanHorn
Mike VanHorn el 31 de Jul. de 2019
I have seen the "Troubleshoot Common Problems" page you referenced, but I had used the formula on this page
to open very much fewer ports on the server. However, based on your suggestion, I have opened from BASEPORT to BASEPORT+2000, inclusive. I'm hoping that helps with another problem, which is that nodes are crashing with errors about not being able to "communicate with the client", even when nothing is running.
Unfortunately, opening all of these extra ports did not fix the problem I posted about; the "Parallel pool test (parpool)" still fails from a client machine outside of the cluster.
I was poking around using lsof while running the validation tests on the headnode (where everything passes), and it appears to me that during the "Parallel pool test (parpool)" test, the client is making direct connections to the workers, and not going through the server. As this seems to be the case, there is no way this is ever going to work in a cluster situation, because the compute nodes have private IPs (192.168.*.*), and there's no way for the client to be able to originate a connection with them.

Lucas
Lucas el 15 de Jun. de 2023
Is there an update on this topic? I am facing the same Issue and my guess is it is the same Problem of Mike
  1 comentario
Mike VanHorn
Mike VanHorn el 15 de Jun. de 2023
It's been nearly four years ago, but I don't think I ever got it working from a computer outside the cluster, due to the way it seemed to be designed (as I explained in my comment). I believe I had to give the user an account on the cluster, and he had to run Matlab there (over the network, tunneling X, iirc).
However, that has been a long time ago, so they may have changed something by now.

Iniciar sesión para comentar.

Categorías

Más información sobre Parallel Computing Fundamentals en Help Center y File Exchange.

Productos


Versión

R2019a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by