Lagged correlation, bias, and missing values

Question

Samuel diabate el 4 de Jul. de 2022

0
Enlazar

Enlace directo a esta pregunta

https://la.mathworks.com/matlabcentral/answers/1753435-lagged-correlation-bias-and-missing-values

xcorrnan.m

In short

I am looking to compute lagged correlation. I wonder why the following code:

t=0:999;
MAXLAG=999;
x=randn([1000,1]);
y=randn([1000,1]);
[C1,lags]=xcorr(x,y,'coeff',MAXLAG); % Lagged correlation using xcorr

does not produce the same lagged correlations as the following code:

t2=min(t)-MAXLAG:max(t)+MAXLAG;
[~,i1,i2]=intersect(t,t2);
X=NaN(size(t2));
X(i2)=x(i1);
Y=NaN(size(t2));
Y(i2)=y(i1);
YY=NaN([2*MAXLAG+1,size(t2,2)]);
for i=1:2*MAXLAG+1
    k=lags(i);  
    YY(i,:) = circshift(Y,k);
end
C2=corr(X',YY','rows','pairwise'); % Lagged correlation using NaNs, circshift and corr

This can be checked by doing a simple plot. The two vector C1 and C2 diverge quite dramatically for important lags.

figure
plot(lags,C2,lags,C1)

Why is this? And how can the second code be upgraded to behave like xcorr?

Lengthy details for the bravests

I am trying to write a function which allows to compute lagged correlations between two timeseries x and y spanning different periods. The goal is to have a function which works in different cases:

The two timeseries do not share any timesteps in common,
One of the two timeseries encompasses the other,
The two timeseries share timesteps in common without one encompassing the other.

The build-in Matlab function xcorr works only in the two last cases, and only on the timespan when both x and y co-exist . With a bit of fiddling, it is possible to apply xcorr in the first case too. But in all cases, lagged correlations obtained from xcorr will be either 1) limited in length by the smallest of the two timeseries or 2) incorrects, because of important zero-padding. It is easy to think of situations where this is not satisfying - in particular, when one of the two timeseries is much longer than the other.

A simple work around is to express x and y on a common temporal axis, fill ends with NaNs, and compute lagged correlation between gappy timeseries. Yet, lagged correlations between timeseries with missing elements cannot be done with xcorr. I know 'there is no simple solution to this problem' (https://uk.mathworks.com/matlabcentral/answers/232636-cross-correlation-using-xcorr-in-the-presence-of-nan-or-missing-values?s_tid=srchtitle), but I am trying hard to find one.

My solution so far is to compute correlation using the function corr(...,'rows','pairwise') between input variable x and a matrix Y made up of lagged versions of input variable y, . The function corr accepts NaNs. This is all done in a script which is attached, if you want to have a look. Here is an exemple of what the matrix Y looks like. At lag == 0, we have Y(lag==0)=y. All white spaces on top left and bottom right corners are NaNs.

I thought this would work okay, but no. I checked to see how xcorrnan compared to xcorr in the simple case where x and y share the same timespan, and there is no missing data. The following code, which computes lagged-correlation between two almost identical sinusoid, plots the following figure which shows the difference of behaviour between my function xcorrnan and Matlab build-in xcorr.

Fs=4; % Sampling frequency
f=1/50; % Frequencies of the signal generated below
t=(1800:1/4:1999)'; 
x=3*cos(2*pi*f*t)+randn([size(t),1])/10;
y=2*cos(2*pi*f*t)+randn([size(t),1])/10;
figure; plot(t,x,t,y); grid on; box on; xlabel('Year');
legend('x','y')
maxlag=200;
C1 = xcorrnan(t,x,t,y,[-maxlag:1/Fs:maxlag],0,1/Fs,'N');
C2 = xcorr(x,y,maxlag*Fs,'coeff');
figure
plot([-maxlag:1/Fs:maxlag],C1,'LineWidth',2)
hold on
plot([-maxlag:1/Fs:maxlag],C2,'LineWidth',2)
legend('xcorrnan','xcorr')
ylim([-1 1])

Clearly, it seems xcorrnan overestimates absolute correlations for any lag other than 0. There are several things I do not understand:

Why do xcorr-obtained correlations tend towards 0 for greater absolute lags? I expect it has to do with bias correction here, but don't really know. Otherwise it could be the influence of the spectra of the rectangular window.
Should xcorr-obtained correlations be considered 'better'? I understand the results of xcorrnan. Correlations should periodically return to 1 since the input signals are almost-periodic. Also, this figure from wikipedia backs up the output of xcorrnan: https://commons.wikimedia.org/wiki/File:Cross_correlation_animation.gif#/media/File:Cross_correlation_animation.gif

Now assuming that the output from xcorr is correct and the go-to objective (which is, let's be honest, extremely likely to be the case):

If it is a matter of bias, I do not understand why corr(xx,Y,'rows','pairwise') which accounts for the missing elements in its inputs, does not appropriately correct for bias?
Is there a simple formula which would allow to correct all of this? Perhaps multiplying by , or in the general case where the data is gappy, multiplying by where is the number of x-y paired used for each correlation?

Thanks a lot for your help. And sorry, I know that's a lot of reading.

Sam