Why workers keep aborting during parallel computation on cluster?
44 visualizaciones (últimos 30 días)
Mostrar comentarios más antiguos
Muh Alam
el 7 de Dic. de 2020
Comentada: Kojiro Saito
el 8 de Feb. de 2021
I keep getting the warning
Warning: A worker aborted during execution of the parfor loop. The parfor loop will now run again on the remaining workers.
In distcomp/remoteparfor/handleIntervalErrorResult (line 245)
In distcomp/remoteparfor/getCompleteIntervals (line 392)
In parallel_function>distributed_execution (line 741)
In parallel_function (line 573)
In fuction_pa1 (line 100)]
when I run a simulation that has parfor loop on the cluster. I noticed that workers abort excution one after another and that seems to happen more when on a cluster compated to my PC.
I would like to know the reason of this issue, and is there a way to avoid it ?
Thanks.
19 comentarios
Kojiro Saito
el 7 de Feb. de 2021
Heterogenous would be a cause. This link is a system requirement of Parallel Server not Parallel Computing Toolbox, but it says an important point;
"Parallel processing constructs that work on the infrastructure enabled by parpool—parfor, parfeval spmd, distributed arrays, and message passing functions—cannot be used on a heterogeneous cluster configuration. The underlying MPI infrastructure requires that all cluster computers have matching word sizes and processor endianness."
Respuesta aceptada
Kojiro Saito
el 8 de Feb. de 2021
Heterogenous environment would be a cause of this issue. This link is a system requirement of Parallel Server not Parallel Computing Toolbox, but it says an important point;
"Parallel processing constructs that work on the infrastructure enabled by parpool—parfor, parfeval spmd, distributed arrays, and message passing functions—cannot be used on a heterogeneous cluster configuration. The underlying MPI infrastructure requires that all cluster computers have matching word sizes and processor endianness."
2 comentarios
Kojiro Saito
el 8 de Feb. de 2021
If you know the nodes names which are homogeneous, you can specify the nodes with sbatch. For example, if node0 to node4 are the same OS, you can use nodelist option (or -w option).
sbatch --nodelist node[0-4] yourscript.sh
Más respuestas (0)
Ver también
Categorías
Más información sobre Third-Party Cluster Configuration en Help Center y File Exchange.
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!