I'm using a LU decomposition on a matrix A (square, sparse, complex, symmetric (not Hermitian)): [M1,M2,P,Q,R] = lu(A); I then apply the backslash operator to these resulting matrices to build a Schur complement: T = Q*(M2\(M1\(P*(R\B)))); If B were a vector, this second line executes sufficiently fast. But when B is a square matrix, which it is in my case, this operation is very slow since Matlab internally loops through the columns of B (which is my conclusion after having done some tic/tocs). The second line of code runs on a single thread using Matlab 2013b on Win7 64bit which makes this even slower. What is a good way of speeding this up? A parfor loop working on the columns of B didn't work for me in Matlab 2013b due to reasons discussed in http://www.mathworks.com.au/matlabcentral/answers/59250-error-with-parfor-and-big-matrices (despite Edric Ellis saying it had been fixed in 2013a). Anything else I can try? Thanks!

Your operation looks equivalent to T=A\B I would expect that to be the fastest. Note that backslash already does its own lu decomposition internally, but with optimized multi-threading. You seem to be just repeating the decomposition work. Moreover, you've turned one backslash operation into three, possibly triggering three times as many internal lu decompositions.

Backslash with triangular matrices

Herwig el 17 de Dic. de 2013

Editada: Herwig el 17 de Dic. de 2013

Hi Matt, thank you for your answer. Several comments:

T=A\B is always fastest, but it is also always the most memory consuming! If you change, for example, to the complete ilu(...,'crout') you can fit much larger problems into memory.
Often, you want to solve a system more than once with different right hand sides (which is the case for me). This alone should warrant the use of explicit LU decomposition because you only need to solve the system once.
The backslash operator will not trigger an LU decomposition for triangular system matrices. See http://www.mathworks.com.au/help/matlab/math/systems-of-linear-equations.html, and then "Permutations of Triangular Matrices" for more information. Hence, LU decomposition is actually only performed once, that is, no decomposition work is repeated.
The backslash operator does not use multithreading for forward/backward substitutions of triangular matrices.

Matt J el 17 de Dic. de 2013

Editada: Matt J el 17 de Dic. de 2013

Abrir en MATLAB Online

Hi Herwig,

The backslash operator does not use multithreading for forward/backward substitutions of triangular matrices.

You mean even across columns of B, mldivide is not multi-threaded? I wouldn't mind seeing a doc source for that. If columnwise processing is multi-threaded, then parallelizing it yourself using parfor or anything else has little hope of improving things further.

But if you are right, then I wonder if the optimal thing for you would be to co-distribute the columns of B and solve chunks of them separately on different labs

B=distributed(B);
spmd
    T = M2\(M1\B);
end 
T=gather(T);

Further, if you are applying this to multiple matrices B, then presumably there is a reason you are not simply concatenating more B matrices together so as to co-process more columns. Since you are not doing this, I'm assuming you are memory-limited on your client. But co-distribution could allow you to overcome that and process more columns then a single page of B now contains.

Herwig el 18 de Dic. de 2013

Abrir en MATLAB Online

Hi Matt,

co-distribution did the trick! The only thing I needed to change is to split

T=M2\(M1\B);

into

T=M1\B;
T=M2\T;

or else codistribution fails.

And you are right, now I am memory limited, co-distribution is very very memory hungry. It looks like every worker wants their own copy of M1, M2 and B. Is there something I could do about this (apart from buying more memory)? Anyway, answer accepted!

Thanks!

Matt J el 18 de Dic. de 2013

Abrir en MATLAB Online

Herwig,

I can't account for anything you've mentioned here, neither the need to split T=M1\(M2\B) into two statements, nor the need to broadcast an entire copy of B to each worker. The statement

B=distributed(B)

should have appropriately sliced B columnwise and sent small pieces to the labs. It is true that an entire copy of M1 and M2 needs to be broadcast to the workers, but you only have to do that once. In subsequent SPMD blocks, M1 and M2 should still be there, as long as you don't close the matlabpool.

You could try doing the lu decomposition on all the labs so as to avoid broadcasting copies,

 spmd
   [M1,M2,P,Q,R] = lu(A);
 end

Herwig el 19 de Dic. de 2013

Editada: Herwig el 19 de Dic. de 2013

Hi Matt,

I tried putting the LU decomposition into the SPMD block. It works but this way Matlab needed about 30% more time AND memory... So I've gone back to doing it with separate statements as outlined in my previous comment. I don't understand why your Matlab appears to be behaving better than mine.

Another thing I thought about is using memmapfile with the SPMD block just to reduce memory requirements but that caused an error in Matlab. I'm not sure whether SPMD can be used with memmapfile at all.

Matt J el 19 de Dic. de 2013

Editada: Matt J el 19 de Dic. de 2013

Abrir en MATLAB Online

I'm starting to think you are too memory-limited for what you are trying to do and should install more RAM. It's very cheap these days (or so I'm told).

Note that the Parallel Computing Toolbox is inherently memory-greedy and assumes you have lots of RAM to play with. It essentially works by launching the equivalent of several MATLAB sessions and broadcasting clones of all relevant data to those sessions (except of course for sliced or codistributed data).

I'm still wondering whether a pre-decomposition is really the best course as well, given your memory limitations. You started with a sparse matrix A and turned it into data M1 and M2 that's apparently now a lot larger and more memory consuming. Are you sure that the temporary memory consumption of A\B outweighs this? What is that memory spent on? What if you were to do

 B=distributed(B);
 spmd
  T=A\B;
 end

Herwig el 20 de Dic. de 2013

Hi Matt,

I'm testing my code on a local machine with 16GB RAM, but can move to our cluster with 120GB RAM per node. So, RAM will be available. I just want to make sure that I can fit as many degrees of freedom into my model as is possible. That's why I'm testing a small problem on my local machine and try to optimize this first.

I tried your suggestion. Solving the problem directly via Gaussian elimination A\B yields much worse performance in terms of memory and CPU time on my system. I think you need to give the explicit LU decomposition some more credit here ;) Gaussian elimination is not memory efficient but the crout type LU decomposition is! Also, remember that I want to solve the system more than once using different right hand sides B_1, B_2, ...

Anyway, your suggestion to use SPMD and distributed() has already helped me a lot. Maybe this is all Matlab can do at this point. Although, I'm still curious to learn whether memmapfile and SPMD can be made working together.

Matt J el 20 de Dic. de 2013

Although, I'm still curious to learn whether memmapfile and SPMD can be made working together.

Not sure why they wouldn't, but you might try MATFILE instead of MEMMAPFILE.

Backslash with triangular matrices

2 comentarios
Mostrar Ninguno Ocultar Ninguno

Respuesta aceptada

8 comentarios
Mostrar 6 comentarios más antiguos Ocultar 6 comentarios más antiguos

Más respuestas (0)

Categorías

Etiquetas

Community Treasure Hunt

Backslash with triangular matrices

2 comentarios Mostrar Ninguno Ocultar Ninguno

Respuesta aceptada

8 comentarios Mostrar 6 comentarios más antiguos Ocultar 6 comentarios más antiguos

Más respuestas (0)

Categorías

Etiquetas

Ver también

Community Treasure Hunt

2 comentarios
Mostrar Ninguno Ocultar Ninguno

8 comentarios
Mostrar 6 comentarios más antiguos Ocultar 6 comentarios más antiguos