How to vectorize nested for loops that compute pair wise correlations using indexed vectors?

Question

Wasim Aftab el 25 de Jun. de 2020

0
Enlazar

Enlace directo a esta pregunta

https://la.mathworks.com/matlabcentral/answers/554692-how-to-vectorize-nested-for-loops-that-compute-pair-wise-correlations-using-indexed-vectors

Editada: Dana el 7 de Jul. de 2020

Hi,

I would like to vectorize two for loops in the following function, which returns the indices of those rows that have pair wise correlations less than a threshold.

function idx_pairs = get_indices_pair_corr(data, idx, idx2, tau)
idx_pairs = [];
for i=1:length(idx)
    for j=1:length(idx2)
        if pearson_corr_coeff(data(idx(i),:), data(idx2(j),:)) < tau
            idx_pairs = [idx_pairs; [idx(i) idx2(j)]];
        end
    end
end

data is a (MxN) matrix
idx, idx2 are vector of unequal/equal lengths
tau is a scalar.
pearson_corr_coeff(...) returns a scalar

If it is not possible to vectorize, please explain why so that I can learn more about the cases when vectorization is not possible.

thnaks

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

Iniciar sesión para comentar.

Iniciar sesión para responder a esta pregunta.

Answer 1

Dana el 25 de Jun. de 2020

0
Enlazar

Enlace directo a esta respuesta

https://la.mathworks.com/matlabcentral/answers/554692-how-to-vectorize-nested-for-loops-that-compute-pair-wise-correlations-using-indexed-vectors#answer_457153

Editada: Dana el 26 de Jun. de 2020

Abrir en MATLAB Online

There might be a better way, but here's how I might go about it.

function idx_pairs = get_indices_pair_corr(data, idx, idx2, tau)
% make sure vectors of indices are column vectors
idx = idx(:);
idx2 = idx2(:);
% lengths of indices
n = length(idx);
n2 = length(idx2);
% extract the rows of data corresponding to idx and then transpose so that
% columns correspond to different variables, and rows to observations
data_red1 = data(idx,:).';
% normalize data by subtracting the column mean, and then dividing by the
% column standard deviation; now each column has mean 0 and std 1
data_red1 = (data_red1 - mean(data_red1,1))./std(data_red1,1,1);
% do the same manipulations for data corresponding to idx2
data_red2 = data(idx2,:).';
data_red2 = (data_red2 - mean(data_red2,1))./std(data_red2,1,1);
% here we reshape data_red1 a 3D array, so that if originally data_red1
% was an NxM matrix, it's now an Nx1xM array
sz = size(data_red1);
reshdata_red1 = reshape(data_red1,[sz(1),1,sz(2)]);
% the (i,j,k) element of the 3D array X will be the product of
% data_red1(i,k) and data_red2(i,j)
X = data_red2.*reshdata_red1;   
% now we compute correlation coefficients by taking the mean along the
% first dimension of those products, and then reshape the resulting 1xn2xn
% array to a n2xn matrix, the (j,k)-th element of which is the correlation
% coefficient between data_red1(:,k) and data_red2(:,j)
crs = reshape(mean(X,1),[n2,n]);
% locate correlations that are less than tau (NOTE: are you sure you don't
% want to check whether the ABSOLUTE VALUE of the correlation coefficient
% is less than tau? If so, replace crs with abs(crs) here.)
tst = (crs < tau);  % logical matrix with (j,k) element 'true' if correlation
                    % between data_red1(:,k) and data_red2(:,j) is less than tau
[i,j] = find(tst);  % get row/column indices associated with the true values
idx_pairs = [idx(j(:)),idx2(i(:))];   % convert to indices from the original data matrix

Depending on the sizes of the arrays you're using, this approach may be much faster than your loop version. For example, if data is 1000x1000, and idx and idx2 each have 100 elements in them, then your code (slight modified because I don't have this function pearson_corr_coeff) took about 1 second to run on my machine, whereas my code took about 0.02 seconds.

8 comentarios
Mostrar 6 comentarios más antiguosOcultar 6 comentarios más antiguos

Wasim Aftab el 26 de Jun. de 2020

Editada: Wasim Aftab el 26 de Jun. de 2020

Abrir en MATLAB Online

Hi Dana

Thanks for your time.

However, there is a logical error in the step

[i,j] = find(tst);

Because the indices are relative to the row numbers of tst matrix but the goal is to find the same w.r.t. row numbers of data matrix.

I think one way to solve that would be to create a matrix M of dimension [length(idx) by length(idx2)] where each cell holds the pairs generated using idx and idx2 in (i,j) form. Then put zeros to all the cells in M that are false in the tst matrix, because tst and M marices are proxies to each other in terms of relative ordering of rows w.r.t. data matrix.

tst = (crs < tau);

Finally, pick up the non zero cells from tst and create idx_pairs matrix and return.

but it is complicated alorithm, looking for something simpler.

Please find the pearson_corr_coeff(..) below.

function r = pearson_corr_coeff(X, Y)
format longg
if ~iscolumn(X) %force into column vec
    X=X';
end
if ~iscolumn(Y) %force into column vec
    Y=Y';
end  
r=((X-mean(X))'*(Y-mean(Y)))/(sqrt(sum((X-mean(X)).^2)*sum((Y-mean(Y)).^2)));
end

Dana el 26 de Jun. de 2020

Abrir en MATLAB Online

You're right that the indices found in [i,j] = find(tst) are relative to the tst matrix, not the data matrix. However, this is corrected at the next step:

idx_pairs = [idx(j),idx2(i)];

So if tst(i,j)=true, then it means the correlation between data_red1(:,k) and data_red2(:,j) is less than tau, which in turn means row idx(j) and row idx2(i) of data have correlation less than tau, and therefore we want to include the pair [idx(j),idx2(i)] in idx_pairs.

Now, playing around with it, it turns out there are potentially some minor discrepancies between the results of my code and yours. However, this is due to a minor error in my computation of the correlation coefficient. The code I posted before incorrectly divides by the unbiased estimate of the standard deviation,

rather than the biased one,

If you replace std(data_red1,0,1) and std(data_red2,0,1) where they appear by std(data_red1,1,1) and std(data_red2,1,1), respectively, you should now find that my code and your code produce exactly the same answers..

Wasim Aftab el 7 de Jul. de 2020

Abrir en MATLAB Online

Thanks Dana, it worked. I was testing your function inside a code which is used to process a large dataset. The code ran for 190406.349s but interestingly your function which was called 106030 times, took only 25.396s. very fast indeed!

However, the most time consming step (~97% of total time) has been the following step, which was called 92989 times. neg is a very large 3d matrix whose size is approximately (4.6million X 81 X 2).

neg = cat(1, neg, create_pairs(idx_pairs_bad_corr, data));

where, create_pairs(...) is defined as follows,

function pos = create_pairs(idx_pairs, data)
pos = zeros(size(idx_pairs, 1), size(data, 2), 2);
pos(:,:,1) = data(idx_pairs(:,1), :);
pos(:,:,2) = data(idx_pairs(:,2), :);

do you have any clue how to optimize that?

Dana el 7 de Jul. de 2020

Editada: Dana el 7 de Jul. de 2020

Abrir en MATLAB Online

As far as I can tell, the most obvious direction to try to improve surrounds your array neg. It looks like this array is contained within a loop, and each time through the loop you're adding more rows to it. This is very slow. Basically, each time through the loop, MATLAB has to create a whole new array with the new dimensions of neg will be, copy the information in the old neg into the first bunch of rows, and then put the new information in the new rows. That's a lot of copying, and MATLAB is actually probably warning you in the code editor that this not a great idea (I'm guessing neg has a squiggly orange line underneath it, and if you hover your mouse over it tells you that it's changing size on each iteration, and that you should consider pre-allocating).

Depending on whether or not you know ahead of time roughly how many rows neg will end up having, I see two options that could potentially speed this up.

Option 1: # of rows known

If you know in advance how many rows neg will end up having, you can simply pre-allocate neg with the correct number. So for example, you can do something like the following:

neg = zeros(nrws,81,2); % pre-allocation, where nrws is the known # of rows
ctr = 0; % counter to track how many rows of neg have already been assigned
for j = 1:n % n here is the total number of loops
    
    % < put code here to get idx_pairs_bad_corr and data for iteration j >
    
    pos = creat_pairs(idx_pairs_bad_corr, data);    % get pairs for iteration j
    nnew = size(idx_pairs_bad_corr,1); % number of new rows to be added to neg
    neg(ctr+1:ctr+nnew,:,:) = pos;  % assign rows of neg for iteration j
    ctr = ctr+nnew; % advance counter
end

Option 2: # of rows unknwon

If you don't know how many rows neg could have in general, you can do something like the following:

negcll = cell(n,1); % j-th entry of this cell array will hold the matrix pos for j-th iteration
nnew = zeros(n,1); % vector to hold number of new rows at j-th iteration
for j = 1:n % n here is the total number of loops
    
    % < put code here to get idx_pairs_bad_corr and data for iteration j >
    
    % put pairs for iteration j in j-th element of cell array negcll
    negcll{j} = creat_pairs(idx_pairs_bad_corr, data);
    nnew(j) = size(idx_pairs_bad_corr,1); % number of new rows at j-th iteration
end
neg = zeros(sum(nnew),81,2); % pre-allocate neg with correct total number of rows
ctr = 0; % counter to track how many rows of neg have already been assigned
for j = 1:n
    neg(ctr+1:ctr+nnew(j),:,:) = negcll{j};  % assign rows of neg for iteration j
    ctr = ctr+nnew(j); % advance counter
end

EIther option above should make a noticeable improvement in speed, though just how much depends on a number of factors. I'd be curious to hear how much of a difference it makes in your case.

Iniciar sesión para comentar.

How to vectorize nested for loops that compute pair wise correlations using indexed vectors?

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

Respuestas (1)

8 comentarios
Mostrar 6 comentarios más antiguosOcultar 6 comentarios más antiguos

Ver también

Categorías

Etiquetas

Community Treasure Hunt

How to vectorize nested for loops that compute pair wise correlations using indexed vectors?

0 comentarios Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

Respuestas (1)

8 comentarios Mostrar 6 comentarios más antiguosOcultar 6 comentarios más antiguos

Ver también

Categorías

Etiquetas

Community Treasure Hunt

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

8 comentarios
Mostrar 6 comentarios más antiguosOcultar 6 comentarios más antiguos