How can I work around a race condition on a Parallel Computing job storage location?

8 visualizaciones (últimos 30 días)
When submitting multiple parallel computing jobs simultaneously from different MATLAB instances, a race condition can occur at the job storage location. This can especially be an issue on MATLAB Production Server workers. This manifests as a crash or hang when a second pool is opened at the same time as the first pool, and produces a variety of strange errors like the following:
Error using parallel.Job/preSubmit (line 592)
Unable to read MAT-file /nfs/pathname/username/.matlab/local_cluster_jobs/R2019a/Job1.in.mat.
File might be corrupt.
Error using parpool (line 113)
Failed to convert value stored in Settings for property JobStorageLocation to a
datalocation.
Error using parallel.internal.pool.InteractiveClient>iThrowWithCause (line 678)
Failed to start pool.
Error using parallel.Job/createTask (line 320)
Only one task may be created on a concurrent Job.
Error using parpool (line 113)
Invalid default value for property 'ParallelNode' in class 'parallel.internal.settings.ParallelSettingsTree':
No value is set for setting 'PCTVersionNumber' at any level.
How can I work around this issue?

Respuesta aceptada

MathWorks Support Team
MathWorks Support Team el 21 de Jul. de 2020
Editada: MathWorks Support Team el 11 de Jun. de 2020
When multiple MATLAB Clients on the same file system submit parallel computing jobs simultaneously, they copy files to the default job storage location to coordinate the creation of a parpool. If two or more jobs do this simultaneously, one job will start altering files of another job, thus creating a race condition and introducing errors.
Attached to this answer is a piece of code that can be used to alter the behavior of the naming of the default job location. If run before ANY Parallel Computing code in a MATLAB session, it will add the current Proccess ID to the end of the default JobStorageLocation of ALL known profiles in that MATLAB session. This change persists as long at the MATLAB process is around, but does not persist beyond that period. This will stop multiple MATLAB clients from using the same default job storage location and thus work around the race condition.
NOTE that new JobStorageLocations will be created, so the user will need to periodically clear them.
The preferred solution is to simply point MATLAB workers to different job storage locations, or avoid submitting multiple jobs simultaneously to the same storage location. However, the above is an alternative when either of these is not possible.

Más respuestas (0)

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by