Largest portion of largest correlation coefficient

2 visualizaciones (últimos 30 días)
Roohollah
Roohollah el 22 de Sept. de 2023
Editada: William Rose el 22 de Sept. de 2023
I have n measurments as follow:
(x1,x2), (x2,x3),...,(xn,yn).
Now I want the largest portion that gives a correlation coefficient of more than a prespecified value.
How can I do that?
  2 comentarios
Torsten
Torsten el 22 de Sept. de 2023
You want to extract m <= n point pairs (xi,yi) with m as large as possible such that their correlation coefficient exceeds a given specified value ? Is this the correct interpretation of your question ?
Roohollah
Roohollah el 22 de Sept. de 2023
Yes it is.
But the extracted points must be in the same order as original observation.
For example, you cannot omi 5th, 19th and 25th observations and they say it is ok, the correlation is higher than the minimum. You can just omit data from beginning and the end of observations.

Iniciar sesión para comentar.

Respuesta aceptada

Bruno Luong
Bruno Luong el 22 de Sept. de 2023
% dummy data
x = rand(1,500);
x = x + +0.01*randn(size(x));
x = sort(x);
y = x.^2+0.01*randn(size(x));
cthreshold = 0.99;
n=length(x);
int_se = nchoosek(1:n,2);
l=int_se(:,2)-int_se(:,1)+1;
[l,is]=sort(l,'descend');
int_se = int_se(is,:);
for k=1:size(int_se,1)
subidx = int_se(k,1):int_se(k,2);
xs = x(subidx);
ys = y(subidx);
R = corrcoef(xs,ys);
rxy = R(1,2);
if rxy > cthreshold
break
end
end
l = length(xs)
l = 337
plot(xs,ys,'.')

Más respuestas (1)

William Rose
William Rose el 22 de Sept. de 2023
I will suggest an approach, but first, why do you want to do this? It sounds suspicious: like you are selecting a subset of the data to get a correlation that is high. Why could this ever be justified?
Do a correlation through all your data. If the correlation is below the threshold, then find the biggest outlier, by checking how much each point deviates from the regression line. Eliminate that point, and recalculate the regression without that point. Repeat this process, eliminating the biggest remaining outlier each time, outlier until the correlation reaches the desired level.
  6 comentarios
Roohollah
Roohollah el 22 de Sept. de 2023
No man. This is not the case. If such thing happens in your observations, it means that there is definitely something wrong in your experiment and you have to do it again.
William Rose
William Rose el 22 de Sept. de 2023
Editada: William Rose el 22 de Sept. de 2023
If I understand your response correctly, then the approach you must use is already prescribed: you drop the first or last element of the vector each time until the correlation reaches the desired level.
I do not understand the rationale for this approach, but I am sure there is one. It is not obvious to me that the approach you have descirbed will work very well. But maybe it is good for data with certain typical error properties.
If myunderstanding is correct, then this seems pretty straightforward. If you are new to Matlab, or new to programming in general, then maybe it is not obvious how to do it.
corrGoal=0.9;
noiseAmpl=10;
N=50; x0=1:N;
y0=x0+noiseAmpl*randn(1,N); % data to analyze
% Next line creates data with noise that is largest at the
% beginning and the end
% y0=x0+noiseAmpl*(2/N)*(-(N-1)/2:(N-1)/2).*randn(1,N);
x=x0; y=y0;
rhoMtx=corrcoef(x,y); % 2x2 matrix of correlation coefficients
rho=rhoMtx(1,2); % correlation between x and y
p=polyfit(x,y,1);
yfit=polyval(p,x);
yresid=y-yfit;
fprintf('Initial correlation (N=%d): %.3f\n',N,rho)
Initial correlation (N=50): 0.870
while abs(rho)<corrGoal
if yresid(1)>yresid(N)
x=x(2:N); y=y(2:N); % discard initial point
else
x=x(1:N-1); y=y(1:N-1); % discard last point
end
N=N-1; % decrement N
p=polyfit(x,y,1);
yfit=polyval(p,x);
yresid=y-yfit;
rhoMtx=corrcoef(x,y);
rho=rhoMtx(1,2);
end
fprintf('Corr=%.3f, slope=%.2f, intercept=%.2f, N=%d.\n',rho,p(1),p(2),N);
Corr=-1.000, slope=-4.53, intercept=102.94, N=2.
plot(x0,y0,'b+',x,y,'ro',x,yfit,'-r'); % plot results
xlabel('X'); ylabel('Y'); axis equal; grid on
legend('original','final','final regression','Location','southeast')
On each pass, the code above eliminates the first or last point, whichever deviates more from the regression line.
The script runs without error. With noiseAmpl=10, the desired correlation is attained when there are two points remaining, at which point abs(corr)=1.00, of course. The two-point fit does not reflect the overall relationship between the original vectors x0 and y0. Therefore is not very interesting or satisfying.
If the original y values had larger noise at the start and end than in the middle, I expect you would get a more pleasing result, i.e. a result in which only a few points at the start and end would be eliminated. I included a commented-out line in the script, which does this. You can un-comment it, and see what happens.

Iniciar sesión para comentar.

Productos


Versión

R2023a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by