Why are parfor loops so inefficient?

In terms of memory, that is. A parfor loop uses vastly more memory than its for-loop counterpart, apparently because it makes copies of all of the data for each thread. But it does this even when the data are read-only, and therefore such copies are completely unnecessary--simultaneous reads of a piece of data from multiple threads are just fine, in general. Moreover, Matlab clearly already knows what data is read only, through its 'classification'. Yet the copies are made anyway. I have lost a lot of time as my system grinds to a halt when trying to run parallelized code on large data files. Is there any way to remedy the situation? Or is it just a programming fail we have to live with (at least for now)?

 Respuesta aceptada

Matt J
Matt J el 20 de Dic. de 2013
Editada: Matt J el 20 de Dic. de 2013

0 votos

It isn't quite the case that copies of all the data are always made. Sliced variables are not copied, nor are distributed variables (used in SPMD). I think you're expected to be partitioning up your computation to take advantage of that.
simultaneous reads from multiple threads are just fine.
Might help if you elaborate on that. My understanding was that threads trying to read from the same memory location is a major problem in parallel computing, because some threads then have to wait idly for their turn at access.

7 comentarios

Matthew Phillips
Matthew Phillips el 20 de Dic. de 2013
>>I think you're expected to be partitioning up your computation to take advantage of that.
Please. Matlab's main value proposition (for many users at least) is ease of development. I already work things up to the point that my code doesn't get MLint warnings, so I'm not missing obvious slices. Preprocessing further just takes even more time and adds code complexity. 'Broadcast' variables are already read-only by definition--"never assigned inside the loop" according to the page. So Matlab should be able to handle them efficiently.
As for simultaneous reads, threads waiting their turn to read is not a problem. Crashes and UB are a problem, but simultaneous reads do not cause these. And threads should not, in general, have to wait. The objects in memory we're talking about here are huge--so different threads will be operating on different parts of them at a given time and there will be no conflict. You would get waiting at most when two threads happen to be dealing with the same small piece (which in my case doesn't happen). Even if there was occasional blocking, that's insignificant in terms of its effect on productivity in comparison with all my memory and swap getting eaten up.
Matt J
Matt J el 21 de Dic. de 2013
Matlab's main value proposition (for many users at least) is ease of development.
Well, ease of development has only ever been an attribute of MATLAB for those willing to compromise on speed performance and memory consumption. You always get higher performing code if you implement in C/C++. Your case seems to be a simple continuation of that rule ;)
Anyway, I think the answer to your original question is yes. You will have to live with it unless you want to go deeper into your code and figure out how to better optimize it to work with parfor. That's not to say things should be that way, but I'm pretty sure they are that way.
Matthew Phillips
Matthew Phillips el 21 de Dic. de 2013
Editada: Matthew Phillips el 21 de Dic. de 2013
Yes, Matlab vs. C/C++ is precisely the design decision I face every day. I have experience with extremely high-performing, multi-threaded code in C/C++ but Matlab's higher productivity forces me to use it in some research applications. Higher productivity until now that is.
But just to be clear, "better optimize" means eliminate all broadcast variables (of significant size)? That is what this seems to come down to right? (Which, for the record, is not feasible with large structs.)
Matt J
Matt J el 21 de Dic. de 2013
Editada: Matt J el 21 de Dic. de 2013
But just to be clear, "better optimize" means eliminate all broadcast variables (of significant size)? That is what this seems to come down to right?
Yes, eliminate, or figure out how to slice them. You seem to be saying that different parallel workers never read from the same locations, so in theory this should be possible.
There are also ways to build large data on the labs without broadcasting them, see e.g.,
and moreover to have the data persist there. It wasn't clear to me whether the broadcasting time is your actual problem or that you're hitting memory consumption limits from the data cloning.
Walter Roberson
Walter Roberson el 21 de Dic. de 2013
You get waiting when threads are dealing with the same cache line, which is more of a problem than individual addresses. If you slice your arrays the wrong way (e.g., across a row) then you can get cache conflicts that slow operations substantially even though the same location is not being referenced.
High performance operation can depend upon preventing cache line conflicts (and especially cache line thrashing). One software technique that can be quite useful is to force arrays to be deliberately unaligned so that even if the arrays are the same size, A(I,J) is not on the same cache line as B(I,J). Unfortunately MATLAB does not offer any tools to force alignment, or to propagate alignment (e.g., on copy-on-write).
On some hardware platforms, any number of processors can simultaneously read from memory provided that no processor is trying to write to the same cache line. I have not investigated how cache coherency handled with x64 architectures.
Matt J
Matt J el 21 de Dic. de 2013
Unfortunately MATLAB does not offer any tools to force alignment, or to propagate alignment
What if you assign A(I,J) and B(I,J) to cell arrays?
C{1}=A(I,J)
C{2}=B(I,J)
Walter Roberson
Walter Roberson el 21 de Dic. de 2013
No joy, Matt J: each element of a cell array has its own header block that includes a pointer to the data block, and users have no control over the alignment of the data block. If the data blocks are small enough they are going to come out of the "small store" that the memory manager uses instead of coming out of complete system blocks allocated as-needed for larger memory. And if the blocks are large enough to come out of the complete system blocks then they are going to get allocated at the same relative offset, which is often the worst thing for traversing arrays.
After that, when operations are done on the arrays, if the operation is one of a number of common patterns and the arrays are large enough, they are going to be copied for use by BLAS, and there is no control over how they get allocated in that copying.

Iniciar sesión para comentar.

Más respuestas (0)

Categorías

Preguntada:

el 20 de Dic. de 2013

Comentada:

el 21 de Dic. de 2013

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by