How to vectorize nested for loops that compute pair wise correlations using indexed vectors?

6 visualizaciones (últimos 30 días)
Hi,
I would like to vectorize two for loops in the following function, which returns the indices of those rows that have pair wise correlations less than a threshold.
function idx_pairs = get_indices_pair_corr(data, idx, idx2, tau)
idx_pairs = [];
for i=1:length(idx)
for j=1:length(idx2)
if pearson_corr_coeff(data(idx(i),:), data(idx2(j),:)) < tau
idx_pairs = [idx_pairs; [idx(i) idx2(j)]];
end
end
end
  1. data is a (MxN) matrix
  2. idx, idx2 are vector of unequal/equal lengths
  3. tau is a scalar.
  4. pearson_corr_coeff(...) returns a scalar
If it is not possible to vectorize, please explain why so that I can learn more about the cases when vectorization is not possible.
thnaks

Respuestas (1)

Dana
Dana el 25 de Jun. de 2020
Editada: Dana el 26 de Jun. de 2020
There might be a better way, but here's how I might go about it.
function idx_pairs = get_indices_pair_corr(data, idx, idx2, tau)
% make sure vectors of indices are column vectors
idx = idx(:);
idx2 = idx2(:);
% lengths of indices
n = length(idx);
n2 = length(idx2);
% extract the rows of data corresponding to idx and then transpose so that
% columns correspond to different variables, and rows to observations
data_red1 = data(idx,:).';
% normalize data by subtracting the column mean, and then dividing by the
% column standard deviation; now each column has mean 0 and std 1
data_red1 = (data_red1 - mean(data_red1,1))./std(data_red1,1,1);
% do the same manipulations for data corresponding to idx2
data_red2 = data(idx2,:).';
data_red2 = (data_red2 - mean(data_red2,1))./std(data_red2,1,1);
% here we reshape data_red1 a 3D array, so that if originally data_red1
% was an NxM matrix, it's now an Nx1xM array
sz = size(data_red1);
reshdata_red1 = reshape(data_red1,[sz(1),1,sz(2)]);
% the (i,j,k) element of the 3D array X will be the product of
% data_red1(i,k) and data_red2(i,j)
X = data_red2.*reshdata_red1;
% now we compute correlation coefficients by taking the mean along the
% first dimension of those products, and then reshape the resulting 1xn2xn
% array to a n2xn matrix, the (j,k)-th element of which is the correlation
% coefficient between data_red1(:,k) and data_red2(:,j)
crs = reshape(mean(X,1),[n2,n]);
% locate correlations that are less than tau (NOTE: are you sure you don't
% want to check whether the ABSOLUTE VALUE of the correlation coefficient
% is less than tau? If so, replace crs with abs(crs) here.)
tst = (crs < tau); % logical matrix with (j,k) element 'true' if correlation
% between data_red1(:,k) and data_red2(:,j) is less than tau
[i,j] = find(tst); % get row/column indices associated with the true values
idx_pairs = [idx(j(:)),idx2(i(:))]; % convert to indices from the original data matrix
Depending on the sizes of the arrays you're using, this approach may be much faster than your loop version. For example, if data is 1000x1000, and idx and idx2 each have 100 elements in them, then your code (slight modified because I don't have this function pearson_corr_coeff) took about 1 second to run on my machine, whereas my code took about 0.02 seconds.
  8 comentarios
Wasim Aftab
Wasim Aftab el 7 de Jul. de 2020
Thanks Dana, it worked. I was testing your function inside a code which is used to process a large dataset. The code ran for 190406.349s but interestingly your function which was called 106030 times, took only 25.396s. very fast indeed!
However, the most time consming step (~97% of total time) has been the following step, which was called 92989 times. neg is a very large 3d matrix whose size is approximately (4.6million X 81 X 2).
neg = cat(1, neg, create_pairs(idx_pairs_bad_corr, data));
where, create_pairs(...) is defined as follows,
function pos = create_pairs(idx_pairs, data)
pos = zeros(size(idx_pairs, 1), size(data, 2), 2);
pos(:,:,1) = data(idx_pairs(:,1), :);
pos(:,:,2) = data(idx_pairs(:,2), :);
do you have any clue how to optimize that?
Dana
Dana el 7 de Jul. de 2020
Editada: Dana el 7 de Jul. de 2020
As far as I can tell, the most obvious direction to try to improve surrounds your array neg. It looks like this array is contained within a loop, and each time through the loop you're adding more rows to it. This is very slow. Basically, each time through the loop, MATLAB has to create a whole new array with the new dimensions of neg will be, copy the information in the old neg into the first bunch of rows, and then put the new information in the new rows. That's a lot of copying, and MATLAB is actually probably warning you in the code editor that this not a great idea (I'm guessing neg has a squiggly orange line underneath it, and if you hover your mouse over it tells you that it's changing size on each iteration, and that you should consider pre-allocating).
Depending on whether or not you know ahead of time roughly how many rows neg will end up having, I see two options that could potentially speed this up.
Option 1: # of rows known
If you know in advance how many rows neg will end up having, you can simply pre-allocate neg with the correct number. So for example, you can do something like the following:
neg = zeros(nrws,81,2); % pre-allocation, where nrws is the known # of rows
ctr = 0; % counter to track how many rows of neg have already been assigned
for j = 1:n % n here is the total number of loops
% < put code here to get idx_pairs_bad_corr and data for iteration j >
pos = creat_pairs(idx_pairs_bad_corr, data); % get pairs for iteration j
nnew = size(idx_pairs_bad_corr,1); % number of new rows to be added to neg
neg(ctr+1:ctr+nnew,:,:) = pos; % assign rows of neg for iteration j
ctr = ctr+nnew; % advance counter
end
Option 2: # of rows unknwon
If you don't know how many rows neg could have in general, you can do something like the following:
negcll = cell(n,1); % j-th entry of this cell array will hold the matrix pos for j-th iteration
nnew = zeros(n,1); % vector to hold number of new rows at j-th iteration
for j = 1:n % n here is the total number of loops
% < put code here to get idx_pairs_bad_corr and data for iteration j >
% put pairs for iteration j in j-th element of cell array negcll
negcll{j} = creat_pairs(idx_pairs_bad_corr, data);
nnew(j) = size(idx_pairs_bad_corr,1); % number of new rows at j-th iteration
end
neg = zeros(sum(nnew),81,2); % pre-allocate neg with correct total number of rows
ctr = 0; % counter to track how many rows of neg have already been assigned
for j = 1:n
neg(ctr+1:ctr+nnew(j),:,:) = negcll{j}; % assign rows of neg for iteration j
ctr = ctr+nnew(j); % advance counter
end
EIther option above should make a noticeable improvement in speed, though just how much depends on a number of factors. I'd be curious to hear how much of a difference it makes in your case.

Iniciar sesión para comentar.

Categorías

Más información sobre Logical en Help Center y File Exchange.

Etiquetas

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by