How can I detect and remove outliers from a large dataset?
Mostrar comentarios más antiguos
I am presently trying to process a large dataset (n = 5000000) and I am really facing challenges writing codes that could detect and remove all the outliers present in the dataset. I tried using modified thomson tau method but it didnt work and I am presently trying to apply modified z- score method but still cant make a head go with the matlab codes.
Attached is the plot of the signal with peaks and dips for better understanding. I also want to fill the deleted outlier points with an interpolation and would appreciate a suggestion.
Please, I will appreciate any further assistance on how to get rid of the peaks and dips on the signal and how to fill the removed outliers points with an interpolation.
I will appreciate any suggestions on other methods to use to remove the outliers and if possible codes for the method.
thank you.
2 comentarios
Star Strider
el 12 de Mzo. de 2014
Do you have any trends in your data that you could model, perhaps with nlinfit or other regression routines? I have no idea what you are doing or what your data are, but detecting trends and other patterns first could make your task easier.
Arinze
el 15 de Mzo. de 2014
Respuestas (4)
Shahab B
el 30 de Sept. de 2016
1 voto
How can I used it for simple data such as: main=[0 347.666506871168 97.948966303887 98.8584847142621 96.4002074686564];
note that the outlier data is = 347.666506871168 .
4 comentarios
Image Analyst
el 30 de Sept. de 2016
Try this
clc;
clearvars;
close all;
workspace;
fontSize = 30;
vector = [0 347.666506871168 97.948966303887 98.8584847142621 96.4002074686564];
%---------------------------------------------------------------
% This won't work. Not enough data to determine percentiles.
percntiles = prctile(vector,[5 95]) %5th and 95th percentile
outlierIndexes = vector < percntiles(1) | vector > percntiles(2);
% Extract outlier values:
outliers = vector(outlierIndexes)
% Extract non-outlier values:
nonOutliers = vector(~outlierIndexes)
%---------------------------------------------------------------
% This will work.
% Compute the median absolute difference
meanValue = mean(vector)
% Compute the absolute differences. It will be a vector.
absoluteDeviation = abs(vector - meanValue)
% Compute the median of the absolute differences
mad = median(absoluteDeviation)
% Find outliers. They're outliers if the absolute difference
% is more than some factor times the mad value.
sensitivityFactor = 6 % Whatever you want.
thresholdValue = sensitivityFactor * mad;
outlierIndexes = abs(absoluteDeviation) > thresholdValue
% Extract outlier values:
outliers = vector(outlierIndexes)
% Extract non-outlier values:
nonOutliers = vector(~outlierIndexes)
%---------------------------------------------------------------
% Fancy plots in the following section. Delete if you don't need it.
% Show the original data and the absolute deviation
subplot(2, 1, 1);
bar(vector);
hold on;
line(xlim, [meanValue, meanValue], 'Color', 'r', 'LineWidth', 2);
grid on;
title('Original Data', 'FontSize', fontSize);
message = sprintf('Mean Value = %.2f', meanValue);
text(3, 150, message, 'FontSize', 18, 'Color', 'r');
subplot(2, 1, 2);
bar(absoluteDeviation);
grid on;
title('Absolute Deviations', 'FontSize', fontSize);
% Put a line for the mad.
line(xlim, [mad, mad], 'Color', 'r', 'LineWidth', 2);
message = sprintf('MAD Value = %.2f', mad);
text(3, 50, message, 'FontSize', 18, 'Color', 'r');
% Put a line for the mad.
line(xlim, [thresholdValue, thresholdValue], 'Color', 'r', 'LineWidth', 2);
message = sprintf('Outlier Threshold Value = %.2f', thresholdValue);
text(3, 200, message, 'FontSize', 18, 'Color', 'r');
% Set up figure properties:
% Enlarge figure to full screen.
set(gcf, 'Units', 'Normalized', 'OuterPosition', [0 0 1 1]);
% Get rid of tool bar and pulldown menus that are along top of figure.
set(gcf, 'Toolbar', 'none', 'Menu', 'none');
% Give a name to the title bar.
set(gcf, 'Name', 'Demo by ImageAnalyst', 'NumberTitle', 'Off')

Image Analyst
el 21 de Nov. de 2016
There are several definitions of MAD. My code above does definition 1.2.1 as listed on this page https://en.wikipedia.org/wiki/Average_absolute_deviation which gives 4 definitions using all combinations of mean and median. You're welcome to use whichever of those definitions best meets your needs.
Nivodi
el 14 de Ag. de 2018
Image Analyst, how can I apply this part of your code to several columns?
% Compute the median absolute difference
meanValue = mean(vector)
% Compute the absolute differences. It will be a vector.
absoluteDeviation = abs(vector - meanValue)
% Compute the median of the absolute differences
mad = median(absoluteDeviation)
% Find outliers. They're outliers if the absolute difference
% is more than some factor times the mad value.
sensitivityFactor = 6 % Whatever you want.
thresholdValue = sensitivityFactor * mad;
outlierIndexes = abs(absoluteDeviation) > thresholdValue
% Extract outlier values:
outliers = vector(outlierIndexes)
% Extract non-outlier values:
nonOutliers = vector(~outlierIndexes)t% Compute the median absolute difference
meanValue = mean(vector)
% Compute the absolute differences. It will be a vector.
absoluteDeviation = abs(vector - meanValue)
% Compute the median of the absolute differences
mad = median(absoluteDeviation)
% Find outliers. They're outliers if the absolute difference
% is more than some factor times the mad value.
sensitivityFactor = 6 % Whatever you want.
thresholdValue = sensitivityFactor * mad;
outlierIndexes = abs(absoluteDeviation) > thresholdValue
% Extract outlier values:
outliers = vector(outlierIndexes)
% Extract non-outlier values:
nonOutliers = vector(~outlierIndexes)
Image Analyst
el 12 de Mzo. de 2014
0 votos
That's not large. It's just a fraction of the size of a typical digital image. You can use "deleteoutliers" from Brett Shoelson of the Mathworks: http://www.mathworks.com/matlabcentral/fileexchange/3961-deleteoutliers Or you could try the Median Absolute Deviation (a popular statistical method for detecting outliers) as demonstrated on an image in the file I attached.
7 comentarios
Arinze
el 12 de Mzo. de 2014
Image Analyst
el 13 de Mzo. de 2014
I don't know what modified Thomson tau is, but the other methods I mentioned take a couple of minutes at the most, not all day.
Star Strider
el 13 de Mzo. de 2014
I had to look it up, since I’ve never heard of it either. It seems to be a technique used primarily in mechanical engineering. It is iterative, and begins by removing data it defines as ‘outliers’, then recalculates its criteria and continues removing data and recalculating its criteria until no data satisfy its definition of ‘outlier’.
Arinze
el 14 de Mzo. de 2014
Image Analyst
el 14 de Mzo. de 2014
Plot again with markers. I don't know what's data and what's outliers. To me it looks like data only if it has the exact value of 0 and everything else is an outlier.
Image Analyst
el 15 de Mzo. de 2014
So is anything that is not exactly zero an outlier?
Tim leonard
el 12 de Mzo. de 2014
Trimming your values based on percentiles is quick and powerful -
vector = randi(100,100,1);
percntiles = prctile(vector,[5 95]); %5th and 95th percentile
outlierIndex = vector < percntiles(1) | vector > percntiles(2);
%remove outlier values
vector(outlierIndex) = [];
1 comentario
Image Analyst
el 12 de Mzo. de 2014
But something at the 1% or 99% or 100% percentile is not necessarily an outlier so you could be getting rid of good data. It's quick but I wouldn't call it powerful. I'd call it risky, unless you know for a fact that you have a certain specific amount of noise present.
Amir H. Souri
el 26 de Jun. de 2017
0 votos
Hi, I may be late, but I just want to point out that definition of outlier is totally subjective. In order to find them, you need to estimate the probably distribution of your data, and fit a distribution (say for example Gaussian), and check whether it is statistically significant (you may use Kolmogorov–Smirnov test or a bootstrap method). Then, you will be able to identify the outliers by defining the confidence interval. For example, you can say any data within 95% confidence interval are acceptable and others can be ignored as outliers. As I mentioned there is no absolute answer, and it totally depends on the nature of data and how strict you want to be in regards to the confidence interval.
Good luck!
Categorías
Más información sobre t Location-Scale Distribution en Centro de ayuda y File Exchange.
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!