finding extreme values in a panel data set

Dear Matlab experts
I have
A={
[1] ' ' [NaN] [NaN] [NaN]
[1] '30/11/08' [11.1447] [0.2117] [463.1360]
[1] '28/12/08' [11.7129] [0.2209] [436.6316]
[1] '25/01/09' [11.5152] [0.2212] [441.5430]
[1] '22/02/09' [11.4854] [0.2201] [453.7015]
[1] '22/03/09' [NaN] [0.2185] [461.3925]
[1] '19/04/09' [10.9700] [0.2104] [486.4095]
[1] '17/05/09' [11.4846] [0.2162] [451.4833]
[1] '14/06/09' [10.9967] [0.2158] [475.8620]
[1] '12/07/09' [11.1990] [0.2211] [449.4574]
[2] ' ' [NaN] [NaN] [NaN]
[2] '30/11/08' [10.8638] [0.2175] [472.2659]
[2] '28/12/08' [11.0897] [0.2182] [479.3408]
[2] '25/01/09' [11.0996] [0.2286] [442.6719]
[2] '22/02/09' [11.0473] [0.2211] [481.4548]
[2] '22/03/09' [11.1151] [0.2259] [468.2757]
[2] '19/04/09' [10.9865] [0.2300] [461.5581]
[2] '17/05/09' [10.7931] [0.2259] [487.6257]
[2] '14/06/09' [10.5845] [0.2200] [529.8777]
[2] '12/07/09' [10.8926] [0.2315] [433.3039]
[3] ' ' [NaN] [NaN] [NaN]
[3] '30/11/08' [10.7658] [0.2339] [487.7729]
[3] '28/12/08' [10.5227] [0.2305] [505.4553]
[3] '25/01/09' [10.6700] [0.2279] [480.0745]
[3] '22/02/09' [10.7632] [NaN] [518.5076]
[3] '22/03/09' [10.5065] [0.2271] [470.0111]
[3] '19/04/09' [10.6112] [0.2215] [483.5394]
[3] '17/05/09' [10.5855] [0.2190] [497.3172]
[3] '14/06/09' [10.3085] [0.2158] [515.7492]
[3] '12/07/09' [10.4475] [0.2211] [482.3380]
}
A is a panel data set that shows the evolution of some variables (columns 3-5) over time (column 2) for 3 individuals (column 1).
I would like
1) to detect outliers (their position and their corresponding values) in columns 3, 4 and 5 where an outlier is defined as a “a value that it is larger than (or equal to) 3 standard deviations over the mean”.
2) plot a diagram for each of the columns 3, 4 and 5 where outliers will stand out with, say, a red color
Could I do that with some matlab code?
Thank you very much
just to mention that my real A is huge where I have 400 individuals.

1 comentario

Image Analyst
Image Analyst el 22 de Jul. de 2012
Editada: Image Analyst el 22 de Jul. de 2012
That's far, far from huge. Do you want this done separately for each group listed in column 1? It looks complicated by the fact that not only do NaNs separate the groups listed in column 1, but there are random Nan's scattered about.

Iniciar sesión para comentar.

 Respuesta aceptada

Image Analyst
Image Analyst el 22 de Jul. de 2012
Editada: Image Analyst el 22 de Jul. de 2012
What have you tried so far? Did you try anything along the lines of
column3 = cell2mat(A(:,3))
goodIndexes = ~isnan(column3)
goodColumn3 = column3(goodIndexes)
mean3 = mean(goodColumn3)
std3 = std(goodColumn3)
lowerLimit = mean3 - std3
upperLimit = mean3 + std3
outliers = column3 < lowerLimit | column3 > upperLimit
outlierIndexes = find(outliers);
normalIndexes = find(~outliers);
x = (1:length(column3))';
stem(x(normalIndexes), column3(normalIndexes), 'b');
hold on;
grid on;
stem(x(outlierIndexes), column3(outlierIndexes), 'r');
% Enlarge figure to full screen.
set(gcf, 'units','normalized','outerposition',[0 0 1 1]);
and so on for the other columns?

6 comentarios

salva
salva el 22 de Jul. de 2012
Hi imager.thank you for your reply. Yes it is almost ok apart from the fact that you code assumes that column 3 refers to one group. I want apply this for each group listed in column 1 (groups 1,2,3) and as I said in my initial question I have 400 such groups.
Thank you
Well you probably already did this but you could adapt it to find the groups, then process each group:
column3 = cell2mat(A(:, 3))
column4 = cell2mat(A(:, 4))
column5 = cell2mat(A(:, 5))
badIndexes3 = isnan(column3)
badIndexes4 = isnan(column4)
badIndexes5 = isnan(column5)
separatorIndexes = badIndexes3 & badIndexes4 & badIndexes5;
numberOfGroups = length(separatorIndexes)
% Tack on one more.
separatorIndexes = find([separatorIndexes; true]);
% Preallocate cells for outliers.
% Cells because there are different numbers of outliers for each group.
outliers = cell(1,1);
outlierIndexes = cell(1,1);
normalIndexes = cell(1,1);
outlierValues3 = cell(1,1);
for group = 1 : numberOfGroups
% Go inbetween the separators.
row1 = separatorIndexes(group)+1;
row2 = separatorIndexes(group+1) - 1
extractedGroup3 = column3(row1:row2);
goodIndexes3 = ~isnan(extractedGroup3)
goodColumn3 = extractedGroup3(goodIndexes3)
mean3(group) = mean(extractedGroup3)
std3(group) = std(extractedGroup3)
lowerLimit(group) = mean3 - std3
upperLimit(group) = mean3 + std3
badLogicalIndexes = extractedGroup3 < lowerLimit | extractedGroup3 > upperLimit;
outliers(group) = {badLogicalIndexes};
outlierIndexes(group) = {find(badLogicalIndexes)};
normalIndexes(group) = {find(~badLogicalIndexes)};
outlierValues3(group) = extractedGroup3(badLogicalIndexes)
end
But you probably already know that mean+3*SD is not a good method for finding outliers for small groups because a single outlier could really affect the true mean A LOT. There are better methods, such as the Minimum Absolute Deviation (MAD, http://en.wikipedia.org/wiki/Median_absolute_deviation)
salva
salva el 22 de Jul. de 2012
thanks imager very much for providing the code. What is the rule for an outlier based on "Minimum Absolute Deviation" ?
cheers
salva
salva el 22 de Jul. de 2012
I searched on the internet but could not find a specific rule based on the MAD
Image Analyst
Image Analyst el 22 de Jul. de 2012
You first calculate the MAD. Then you multiply that by a factor that says how many of those you need to consider something an outlier, just like you chose 3 for sigma to say that if it were more than 3 sigma away from the mean it's an outlier. Same thing except use MAD instead of the standard deviation sigma. Like, if that data point's value is more than 2 (or 3 or 4 or whatever suits your situation) MADs away from the mean, then that data point is an outlier.
salva
salva el 22 de Jul. de 2012
thanks imager

Iniciar sesión para comentar.

Más respuestas (0)

Etiquetas

Preguntada:

el 22 de Jul. de 2012

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by