# Help to optimize the execution time of this function, modifying the entries of a matrix (GPU? Parallel CPU?).

1 view (last 30 days)
Thomas Barrett on 31 Jan 2021
Commented: Thomas Barrett on 9 Feb 2021
Hello,
I am looking for some support with how to increase the speed of a piece of code, and would be really grateful for some help.
****** EDIT : My own attempt at optimized code is given in the comment to the OP below ******
I have a complex matrix M and need to iterate over many steps, changing the entries of M on each iteration. The exact operations applied to M at each iteration depends on the previous values of M, as well as several complex vectors which, once defined, do not change. These are represented by a_vec, bvec, ... e_vec in the script below, which are all different. There is also a constant scalar A.
A minimum working example is given below:
%%%%%%%%%%%% Set up and initialisation %%%%%%%%%%%%
N_iter = 20; % Number of iterations
Ny = 500; % Dimensions of matrix M
Nx = 400; %
a_vec = rand(Nx,1) + rand(Nx, 1)*1i; % Predefine some vectors
b_vec = rand(Nx,1) + rand(Nx, 1)*1i; %
c_vec = rand(Nx,1) + rand(Nx, 1)*1i; %
d_vec = rand(Nx,1) + rand(Nx, 1)*1i; %
e_vec = rand(Nx,1) + rand(Nx, 1)*1i; %
A = rand(1) + rand(1)*1i; % A scalar constant
M = rand(Ny, Nx) + rand(Ny, Nx)*1i; % Initialise the matrix
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%% Main loop %%%%%%%%%%%%%%%%%%%%
tic
for k = 1:N_iter
% ...
M = modify_matrix(M, Nx, Ny, A, a_vec, b_vec, c_vec, d_vec, e_vec);
% Do more processing on new M ...
% ...
end
t = toc;
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%% Function to modify matrix M %%%%%%%%%%%
function M = modify_matrix(M, Nx, Ny, A, a_vec, b_vec, c_vec, d_vec, e_vec)
for m = 2:1:(Ny-1) % Apply to every row of M (excluding boundaries)
f_vec = zeros(Nx,1);
f_vec(2) = 0; % Initial condition
for n = 2:1:(Nx-1) % Forward sweep (to determine f_vec entries)
B = A*M(m,n) + a_vec(n)*M(m,n+1) + b_vec(n)*M(m,n-1);
f_vec(n+1) = c_vec(n)*B + d_vec(n)*f_vec(n);
end
for n = Nx:-1:2 % Backward sweep (without modifying boundary values)
M(m,n-1) = e_vec(n)*M(m,n) + f_vec(n); % Update the M entries
end
end
end
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
I am looking for some support on the best way to speed up the function modify_matrix(), which is too slow as I have coded it. Ultimately, the number of iterations will be around 100,000.
I have seen the function parfor , which I feel may be used on the outer loop (over m). My CPU is Intel i7-6700, which has up to 8 threads, so I guess the best improvement would be max x8 using this approach.
Also on my machine is an Nvidia Quadro P4000 GPU, which has 1792 CUDA cores. Is there a way I can make use of this hardware to get much more than x8 performance on this function? I have seen the functions arrayfun(), pagefun(), and bsxfun(), but not sure how these might be used, or whether it even makes sense to consider them in my application. I have tried to simply define everything on the GPU at the beginning, using gpuArray(M),gpuArray(a_vec), ... etc , so that the processing is done on the GPU, but this seems to dramatically increase the execution time.
Any support on the best way to proceed would be great.
Thomas Barrett on 9 Feb 2021
I have attempted to make some improvements following the advice of @Riccardo Bonomelli. I have now created a new function update_row() which is placed inside a wrapper function modify_matrix() , whose job it is to loop over all rows. The rationale for doing this was so that this could easily be used with parfor. The new code is shown below:
%%%%%%%%%%%% Set up and initialisation %%%%%%%%%%%%
N_iter = 100; % Number of iterations
Ny = 500; % Dimensions of matrix M
Nx = 400; %
a_vec = rand(1,Nx) + rand(1, Nx)*1i; % Predefine some vectors
b_vec = rand(1,Nx) + rand(1, Nx)*1i; %
c_vec = rand(1,Nx) + rand(1, Nx)*1i; %
d_vec = rand(1,Nx) + rand(1, Nx)*1i; %
e_vec = rand(1,Nx) + rand(1, Nx)*1i; %
A = rand(1) + rand(1)*1i; % A scalar constant
M = rand(Ny, Nx) + rand(Ny, Nx)*1i; % Initialise the matrix
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%% Main loop %%%%%%%%%%%%%%%%%%%%
tic
for k = 1:N_iter
% ...
M = modify_matrix_mex(M, Nx, Ny, A, a_vec, b_vec, c_vec, d_vec, e_vec);
% Do more processing on new M ...
% ...
end
t = toc;
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
and the functions are as follows:
%%%%%%%%%%%%% Function to modify matrix M %%%%%%%%%%%
function M = modify_matrix(M, Nx, Ny, A, a_vec, b_vec, c_vec, d_vec, e_vec)
parfor m = 2:(Ny-1) % Loop over axial values
M(m,:) = update_row(M(m,:), Nx, Ny, A, a_vec, b_vec, c_vec, d_vec, e_vec);
end
end
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%% Function to modify a row of matrix M %%%%%%
function M_row = update_row(M_row, Nx, Ny, A, a_vec, b_vec, c_vec, d_vec, e_vec)
f_vec = complex(zeros(1,Nx));
B = complex(zeros(1,Nx));
B(2:Nx-1) = A*M_row(2:Nx-1) + a_vec(2:Nx-1).*M_row(3:Nx) + b_vec(2:Nx-1).*M_row(1:Nx-2);
for n = 2:(Nx-1) % Forward sweep
f_vec(n+1) = c_vec(n)*B(n) + d_vec(n)*f_vec(n);
end
for n = Nx:-1:2 % Backward sweep
M_row(n-1) = e_vec(n)*M_row(n) + f_vec(n); % Update the M entries
end
end
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
(In the last function I have moved the creation of B outside the loop, because I saw it can be vectorized. I don't believe the remaining loops can be removed though).
Running the original code for Nt = 100 iterations originally took 5.7 seconds on my machine. Converting modify_matrix() to a MEX function dropped this to 0.75 seconds. (I also tried making update_row() into a MEX function but since it is called from inside the other MEX function I wasn't sure how to compile them in a nested way).
Finally adding the parfor loop reduced again further to 0.43 seconds, on my quad core machine. So a total speedup of x13.
I would like to run this code for Nt = several tens / hundreds of thousands of iterations, not just 100 now, so need a better speedup. Can anyone provide support with how to achieve this?
Many thanks,
Tom

Riccardo Bonomelli on 31 Jan 2021
Hi Thomas, the easiest procedure to implement to speed up the execution is to generate a MEX file of your function using the Matlab Coder app. By doing this Matlab will create a C or C++ version of your function which should run faster. The process to do this is very simple, Matlab will guide step by step, if you have troubles you can refer to the documentation or the tutorials available online. Now the code you uploaded runs in about 2 seconds on my pc, but if I switch to the MEX version of the function the same code runs in about 0.18 seconds.
Other ideas like parallelization or GPU execution are viable (and probably even better) options to speed up the execution but require major changes in the code to reach a good level of performance, honestly I'm not an expert in this subject so my opinion is not completely accurate.
Thomas Barrett on 9 Feb 2021
I see, thanks. I was assuming it would go roughly linearly if the loop was well parallelizable.
@Walter Roberson could you offer any support as to how I can speed up my code futher? The current state of affairs for me is the code given in the comment to the OP above. I have converted to MEX, and added a parfor loop, giving me a x13 speedup so far.
Is there a way that I can use any other tricks (arrayfun()?), or make use of my GPU to get this at least another order of magnitude faster?

R2020b

### Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by