How to speedup the following

Question

TheStranger el 13 de Dic. de 2021

1
Enlazar

Enlace directo a esta pregunta

https://la.mathworks.com/matlabcentral/answers/1609835-how-to-speedup-the-following

Editada: Jan el 15 de Dic. de 2021

Hi there!

I have a problem: I am trying to speedup the calculation of some matrix (2D array).

Consider a rank-3 tensor of size NxNxN (3 dimensional array, if you wish) called A, and a vector of size N called D, and then I construct the two-dimensional array S in the following way:

A = rand(N, N, N);

D = rand(N, 1);

S = zeros(N);

for i = 1:N

for j = 1:N

S = S + A(:, :, i).*A(:, :, j)/(D(i) + D(j));

end

And afterwards, I solve the linear system involving S.

How to speed-up this?

A naive idea is to get rid of the explicit for loop over "j", and replace it with a summation. For instance, I can do the following:

A = rand(N, N, N);

D = rand(N, 1);

S = zeros(N);

Dj = repmat(D, 1, N, N);

Dj = permute(Dj, [3, 2, 1]);

for i = 1:N

S = S + sum(repmat(A(:, :, i), 1, 1, N).*A./(D(i) + Dj), 3)

end

Seems like a right way to go, in the spirit of vectorization, we can get rid of the internal second for loop.

However, it turns out the first version is faster for sufficiently large N than the second one (please, see the picture below). I understand that it is, probably, due to repmat, and sum(..., 3), but is there any way to speed-up this?

Thank you in advance!

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

Iniciar sesión para comentar.

Iniciar sesión para responder a esta pregunta.

Answer 1

Jan el 13 de Dic. de 2021

1
Enlazar

Enlace directo a esta respuesta

https://la.mathworks.com/matlabcentral/answers/1609835-how-to-speedup-the-following#answer_854050

Editada: Jan el 15 de Dic. de 2021

Abrir en MATLAB Online

N = 200;
A = rand(N, N, N);
D = rand(N, 1);
tic  % Original
S = zeros(N);
for i = 1:N
   for j = 1:N
      S = S + A(:, :, i) .* A(:, :, j) / (D(i) + D(j));
   end
end
toc
tic  % Version 1
S = zeros(N);
Dj = repmat(D, 1, N, N);
Dj = permute(Dj, [3, 2, 1]);
for i = 1:N
    S = S + sum(repmat(A(:, :, i), 1, 1, N) .* A ./ (D(i) + Dj), 3);
end
toc
tic  % Version 2
S = zeros(N);
Dj = repmat(D, 1, N, N);
Dj = permute(Dj, [3, 2, 1]);
for i = 1:N  % No need for REPMAT:
    S = S + sum(A(:, :, i) .* A ./ (D(i) + Dj), 3);
end
toc
tic  % Version 3
S = zeros(N);
for i = 1:N
   Ai = A(:, :, i);
   Di = D(i);
   for j = 1:N
      S = S + Ai .* A(:, :, j) / (Di + D(j));
   end
end
toc
tic  % Version 4:
S = zeros(N);
T = zeros(N);
for i = 1:N
   Ai = A(:, :, i);
   Di = D(i);
   T(:) = 0;
   for j = 1:N
      T = T + A(:, :, j) ./ (Di + D(j));
   end
   S = S + Ai .* T;
end
toc
tic  % Version 5a
S  = zeros(N);
AC = num2cell(A, [1,2]);
for i = 1:N
   Di = D(i);
   for j = 1:N
      S = S + AC{i} .* AC{j} / (Di + D(j));
   end
end
toc
tic  % Version 5b
S  = zeros(N);
% Replace: AC = num2cell(A, [1,2]); by inlined (tiny advantage):
AC = cell(1, N);
for k = 1:N
   AC{k} = A(:,:,k);
end
for i = 1:N
   Di = D(i);
   Ai = AC{i};
   for j = 1:N
      S = S + Ai .* AC{j} / (Di + D(j));
   end
end
toc
% R2018b, Mobile i5, 16 GB RAM, N=200:
Elapsed time is 6.507656 seconds.  % Original
Elapsed time is 6.926377 seconds.  % Vectorized
Elapsed time is 4.615969 seconds.  % Loops with copy of matrix
Elapsed time is 4.779539 seconds.  % Why is this not faster?!
Elapsed time is 2.597977 seconds.  % Split once by num2cell
Elapsed time is 2.450893 seconds.  % Split with a loop

7 comentarios
Mostrar 5 comentarios más antiguosOcultar 5 comentarios más antiguos

Jan el 14 de Dic. de 2021

Editada: Jan el 14 de Dic. de 2021

Abrir en MATLAB Online

In the original version and version 1, A(:, :, i) is copied N times. This is avoided in version 3 and 4.

In version 3, all terms are multiplied

   for j = 1:N
      S = S + Ai .* A(:, :, j) / (Di + D(j));
   end   

This is equivalent to:

T = Ai .* A(:,:,1) / d + Ai .* A(:,:,2) / d + Ai .* A(:,:,3) / d + ...
S = S + T;

I've tried to reduce the number of multiplications in version 4:

T = A(:,:,1) / d + A(:,:,2) / d + A(:,:,3) / d + ...
S = S + Ai .* T

This saves N elemenwise multiplications and this should be remarkably faster, but it isn't. I guess that Matlab's JIT acceleration does some magic internally in version 3 already.

I'm still not really impressed.

A full vectorization would require the same amount of multiplications and memory accesses, but a huge intermediate array, which does not match into the CPU cache. Performing the sum in chunks is more efficient.

I've added 2 further version 5a and 5b. Here in the forum they are 4 times faster than the original version, but the speed differs massively in a local version. By the way, it differs also if I run the code in the command windows or from inside a function.

Jan el 15 de Dic. de 2021

Editada: Jan el 15 de Dic. de 2021

Matlab's execution speed is increase by the "JIT", the just-in-time accelerator. It can simplify expressions and store data pointers to avoid the overhead of parsing the header of a Matlab variable. The JIT is not documented and there are a lot of changes between the Matlab version. It matters, if some code appears in a function, or a script or if you run it in the command window. If a function contains an eval to create a variable dynamically or a load without catching the output, the JIT is disabled and the run-time can grow by a factor of 100.

An exhaustive book about efficient Matlab methods is Yair's "Undocumented Matlab", see http://UndocumentedMatlab.com .

Yes, 5a abd 5b are almost identical and run with the same speed on my R2018b.

A(:, :, j) is not cheap, because it creates a new NxN matrix in the memory are copies N*N doubles. Doing this once in num2cell instead of N*N times is an advantage.

I've modified 5b in my code: Now num2cell() is inlined to avoid some overhead.

TheStranger el 15 de Dic. de 2021

Editada: TheStranger el 15 de Dic. de 2021

Thanks! There is so much to learn for me in the subject of how to speed up the code :).

For instance, I do not get the following:

"A(:, :, j) is not cheap, because it creates a new NxN matrix in the memory are copies N*N doubles. Doing this once in num2cell instead of N*N times is an advantage."

Like why does MATLAB has to copy the 2D subarray of A in order to perform element-wise multiplication? As I am not changing A itself, I would have naively expected it not to create a new local variable or is it some sort of an internal thing of the ".*" operation?

Is it also the case for 2D A? Like if I write A(:,i).*A(:,j)? Will it also copy the "i"th, and 'j'th columns?

Jan el 15 de Dic. de 2021

Editada: Jan el 15 de Dic. de 2021

Abrir en MATLAB Online

Maybe Matlab's JIT acceleration can handle

A = rand(10, 10);
A(:, i) .* A(:, j)

without creating the two arrays A(:,i), A(:,j) explicitly. We do not know this, becuse the JIT is not documented. The MathWorks explains: "If we document the JIT, the users adjust their codes to the JIT, but we want to adjusat the JIT to the code of the users." Optimizing from two ends would cause confusions.

If A(:,j) is created explicitly, a new Matlab variable is created. This means a header of about 100 bytes containing, the class and dimensions, the number of chared copies and a pointer to the actual data as well as the actual data. This is the reason why A(:,:,j) needs time, while AC{j} is a cheap access of an existing variable.

James Tursa has implemented a smart function to share data with re-using parts of the original data: https://www.mathworks.com/matlabcentral/fileexchange/65842-sharedchild-creates-a-shared-data-copy-of-contiguous-subset . But even here the overhead for creating a Matlab variable is required.

In A C-Mex function you could access the data by using pointers and there is no overhead for creating a temporary variable. If this piece of code is the bottleneck of the complete code, it might be useful to create a Mex version.

Iniciar sesión para comentar.

How to speedup the following

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

Respuesta aceptada

7 comentarios
Mostrar 5 comentarios más antiguosOcultar 5 comentarios más antiguos

Más respuestas (0)

Ver también

Categorías

Etiquetas

Productos

Community Treasure Hunt

How to speedup the following

0 comentarios Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

Respuesta aceptada

7 comentarios Mostrar 5 comentarios más antiguosOcultar 5 comentarios más antiguos

Más respuestas (0)

Ver también

Categorías

Etiquetas

Productos

Community Treasure Hunt

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

7 comentarios
Mostrar 5 comentarios más antiguosOcultar 5 comentarios más antiguos