Fast vector reshaping/permutation
Mostrar comentarios más antiguos
I'm trying to optimize a very specific vector operation, namely taking a large (2^20 x 1) vector, reshaping it, permuting the indices, and reshaping once more. To be concrete, an example:
A = rand(2^20,1); % Large vector, with one dimension a power of 2
A = A./norm(A); % Normalize just for convenience
DR = 2^6; % DR,DL,DM are powers of 2 which multiply to form the size of A
DL = 2^6;
DM = 2^8;
tic;
B = reshape(permute(reshape(A,DR,DM,DL),[2,1,3]),DM,DR*DL);
toc; % On my machine this takes ~1.2 ms
The above operation is very simple, and entirely limited in speed by the permute step - as I understand it, permutation in matlab requires the entire array to be copied, losing time for the copy to be created and the transfer to occur. I am wondering if there is any clever way to get past this requirement for this specific use-case.
I have tried putting the operation of a gpu (by calling, for instance),
A = rand(2^20,1,'gpuArray')
Which does improve the runtime by a factor of ~4 but also hurts some other areas of my application. I have not yet tried to mexify the code, but would be interested if this seems a viable way to improve as well.
Edit from the comments: Ultimately this reshaped vector/matrix "B" is then multiplied by a Matrix (DM x DM), and then permuted/reshaped back into it's original form. If there is some fast way to combine all of those operations then that would of course be even more ideal.
Edit 2 for further context: As the answers/comments asked for more clarification of the overall use case, I will provide a toy model of a larger chunk of the code. Essentially this is the type of overall operation we are looking to do:
L = 20;
mid_size = 4;
DM = 2^mid_size;
A = rand(2^L,1);
A = A./norm(A);
Ms = rand(DM,DM,L-mid_size+1);
tic;
for left_size = 0:mid_size:(L-mid_size)
right_size = L - mid_size - left_size;
DR = 2^right_size;
DL = 2^left_size;
B = reshape(permute(reshape(A,DR,DM,DL),[2,1,3]),DM,DR*DL);
B_prime = Ms(:,:,left_size+1) * B;
A = permute(reshape(B_prime,DM,DR,DL),[2,1,3]);
end
A = reshape(A, 2^L, 1);
toc;
This is of course embedded in a larger program, but I think this is essentially an isolated "kernel"
5 comentarios
James Tursa
el 15 de Jun. de 2021
Editada: James Tursa
el 15 de Jun. de 2021
A mex routine could avoid the two reshape( ) function calls and could multi-thread the equivalent of the permute( ) operation. Whether this would run any faster than the MATLAB permute( ) function would depend on how the permute( ) function is coded and whether it is multi-threaded. The particular permute( ) operation you are doing is equivalent to a pagewise transpose, i.e. the pagetranspose( ) function. To get the mex routine to run as quickly as possible you would need to implement blocking code for the transpose to optimize use of the cache. Probably the only way to know if this beats the permute( ) or pagetranspose( ) function would be to write it and test it. First thing I would try is to replace your permute( ) with pagetranspose( ) and see if that makes any timing difference.
James Tursa
el 15 de Jun. de 2021
What are you doing with this result downstream in your code? Maybe you can avoid physically doing this operation altogether by adjusting how you do your operations downstream.
pagetranspose seems slower...
A = rand(2^24,1);
A = A./norm(A);
DR = 2^6;
DL = 2^6;
DM = 2^12;
tic;
B = reshape(permute(reshape(A,DR,DM,DL),[2,1,3]),DM,DR*DL);
toc;
tic
B = reshape(pagetranspose(reshape(A,DR,DM,DL)),DM,DR*DL);
toc
James Tursa
el 15 de Jun. de 2021
That's strange. I may have to experiment with some mex code to figure out what is going on with those permute( ) and pagetranspose( ) timings. I would have thought pagetranspose( ) would be optimized to be at least as fast as permute( ), but this is obviously not the case.
Adam Shaw
el 15 de Jun. de 2021
Respuesta aceptada
Más respuestas (2)
Matt J
el 15 de Jun. de 2021
1 voto
No, permute() will be the fastest way (on the CPU). How does the GPU hurt other areas of your application?
James Tursa
el 15 de Jun. de 2021
Editada: James Tursa
el 15 de Jun. de 2021
Don't do the permute( ) operation. Just use pagemtimes( ) downstream in your code with the appropriate 'transpose' option. This will cause the matrix multiply to use code that "virtually" transposes the matrix without actually physically forming it first.
https://www.mathworks.com/help/matlab/ref/pagemtimes.html?searchHighlight=pagemtimes&s_tid=srchtitle
E.g., something like this if I understand your dimensions:
result = reshape(pagemtimes(Matrix,'none',reshape(A,DR,DM,DL),'transpose'),DM,DR*DL);
I think pagemtimes( ) is multi-threaded and uses BLAS in the background so I doubt a mex routine could be written to beat this for speed.
10 comentarios
Probably better/simpler to do,
result = pagemtimes(A,Matrix.');
or on the GPU with pagefun,
result = pagefun(@mtimes, A,Matrix.');
James Tursa
el 15 de Jun. de 2021
1) The two reshapes still need to be there.
2) I am guessing pagemtimes( ) simply passes flags to the BLAS routines so that no physical transpose is done or needed.
1) I don't think the reshapes need to be there. @Adam Shaw has said that the final result is to end up in the shape of the original input.
2) Doesn't the matrix multiplication benefit from better memory contiguity if the matrix is actually in its transposed form?
James Tursa
el 16 de Jun. de 2021
Editada: James Tursa
el 16 de Jun. de 2021
1) I guess we haven't been told for sure the original dimensions of A, but I was assuming the reshape was necessary to get it into paged matrix form based on the example given. Otherwise how is pagemtimes supposed to work with it?
2) For a generic matrix multiply I would think B' * C might be optimal because that puts all of the dot product operand vectors in contiguous memory. But for this case with the transpose on the 2nd operand I frankly don't know if doing a physical transpose first would help or not. My original thought was that it wouldn't, because that puts both dot product operand vectors in non-contiguous memory. But I don't know the details of how the BLAS is coded, so the only way to really find out is to try it. That being said, your suggestion of calculating the tranposed result doesn't match the memory layout of the original calculation, so it seems you would need another permute( ) operation in there.
Just to clarify, my suggestion if A and B must be vectors is to do,
B=pagemtimes( reshape(A,DR,DM,DL), Matrix.'); B(:);
which should be equivalent to what the OP asked for.
However, I wonder if A and B really should/need to be maintained as vectors. I suspect the whole processing chain can be done if A and B are kept in the form of DRxDMxDL 3D arrays.
James Tursa
el 16 de Jun. de 2021
Editada: James Tursa
el 16 de Jun. de 2021
"which should be equivalent to what the OP asked for"
I don't think so. OP is essentially asking for an M * A' operation (I think?), and you are proposing an A * M' operation. Yes the individual element calculations are the same but the result puts the elements in transposed positions in memory.
"I wonder if A and B really should/need to be maintained as vectors"
... another good point.
I don't think so. OP is essentially asking for an M * A' operation (I think?),
@Adam Shaw mentioned above that after the multiplication is done, the result is to be permuted again and put back in its original shape:
"Ultimately this reshaped vector is then multiplied by a Matrix (DM x DM), and then permuted/reshaped back into it's original form."
Therefore, because of the 2nd permutation, the net effect of the operation (ignoring reshaping) is A*M.' on each page of A.
James Tursa
el 16 de Jun. de 2021
Well, I guess OP is going to have to weigh in on what he really wants. His original calculation only had one permute (for the 3D transpose operation) with the two reshapes going between vectors and 3D arrays.
Adam Shaw
el 16 de Jun. de 2021
Categorías
Más información sobre Matrix Indexing en Centro de ayuda y File Exchange.
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!