How can I detect and remove outliers from a large dataset?

I am presently trying to process a large dataset (n = 5000000) and I am really facing challenges writing codes that could detect and remove all the outliers present in the dataset. I tried using modified thomson tau method but it didnt work and I am presently trying to apply modified z- score method but still cant make a head go with the matlab codes.
Attached is the plot of the signal with peaks and dips for better understanding. I also want to fill the deleted outlier points with an interpolation and would appreciate a suggestion.
Please, I will appreciate any further assistance on how to get rid of the peaks and dips on the signal and how to fill the removed outliers points with an interpolation.
I will appreciate any suggestions on other methods to use to remove the outliers and if possible codes for the method.
thank you.

2 comentarios

Do you have any trends in your data that you could model, perhaps with nlinfit or other regression routines? I have no idea what you are doing or what your data are, but detecting trends and other patterns first could make your task easier.
Star Strider, I attached a picture of the plot for better understanding.

Iniciar sesión para comentar.

Respuestas (4)

Shahab B
Shahab B el 30 de Sept. de 2016
How can I used it for simple data such as: main=[0 347.666506871168 97.948966303887 98.8584847142621 96.4002074686564];
note that the outlier data is = 347.666506871168 .

4 comentarios

Try this
clc;
clearvars;
close all;
workspace;
fontSize = 30;
vector = [0 347.666506871168 97.948966303887 98.8584847142621 96.4002074686564];
%---------------------------------------------------------------
% This won't work. Not enough data to determine percentiles.
percntiles = prctile(vector,[5 95]) %5th and 95th percentile
outlierIndexes = vector < percntiles(1) | vector > percntiles(2);
% Extract outlier values:
outliers = vector(outlierIndexes)
% Extract non-outlier values:
nonOutliers = vector(~outlierIndexes)
%---------------------------------------------------------------
% This will work.
% Compute the median absolute difference
meanValue = mean(vector)
% Compute the absolute differences. It will be a vector.
absoluteDeviation = abs(vector - meanValue)
% Compute the median of the absolute differences
mad = median(absoluteDeviation)
% Find outliers. They're outliers if the absolute difference
% is more than some factor times the mad value.
sensitivityFactor = 6 % Whatever you want.
thresholdValue = sensitivityFactor * mad;
outlierIndexes = abs(absoluteDeviation) > thresholdValue
% Extract outlier values:
outliers = vector(outlierIndexes)
% Extract non-outlier values:
nonOutliers = vector(~outlierIndexes)
%---------------------------------------------------------------
% Fancy plots in the following section. Delete if you don't need it.
% Show the original data and the absolute deviation
subplot(2, 1, 1);
bar(vector);
hold on;
line(xlim, [meanValue, meanValue], 'Color', 'r', 'LineWidth', 2);
grid on;
title('Original Data', 'FontSize', fontSize);
message = sprintf('Mean Value = %.2f', meanValue);
text(3, 150, message, 'FontSize', 18, 'Color', 'r');
subplot(2, 1, 2);
bar(absoluteDeviation);
grid on;
title('Absolute Deviations', 'FontSize', fontSize);
% Put a line for the mad.
line(xlim, [mad, mad], 'Color', 'r', 'LineWidth', 2);
message = sprintf('MAD Value = %.2f', mad);
text(3, 50, message, 'FontSize', 18, 'Color', 'r');
% Put a line for the mad.
line(xlim, [thresholdValue, thresholdValue], 'Color', 'r', 'LineWidth', 2);
message = sprintf('Outlier Threshold Value = %.2f', thresholdValue);
text(3, 200, message, 'FontSize', 18, 'Color', 'r');
% Set up figure properties:
% Enlarge figure to full screen.
set(gcf, 'Units', 'Normalized', 'OuterPosition', [0 0 1 1]);
% Get rid of tool bar and pulldown menus that are along top of figure.
set(gcf, 'Toolbar', 'none', 'Menu', 'none');
% Give a name to the title bar.
set(gcf, 'Name', 'Demo by ImageAnalyst', 'NumberTitle', 'Off')
Thx Image Analyst, for the detailed answer. Except though, as far as I know MAD is deviation around median not mean. [ref]
There are several definitions of MAD. My code above does definition 1.2.1 as listed on this page https://en.wikipedia.org/wiki/Average_absolute_deviation which gives 4 definitions using all combinations of mean and median. You're welcome to use whichever of those definitions best meets your needs.
Image Analyst, how can I apply this part of your code to several columns?
% Compute the median absolute difference
meanValue = mean(vector)
% Compute the absolute differences. It will be a vector.
absoluteDeviation = abs(vector - meanValue)
% Compute the median of the absolute differences
mad = median(absoluteDeviation)
% Find outliers. They're outliers if the absolute difference
% is more than some factor times the mad value.
sensitivityFactor = 6 % Whatever you want.
thresholdValue = sensitivityFactor * mad;
outlierIndexes = abs(absoluteDeviation) > thresholdValue
% Extract outlier values:
outliers = vector(outlierIndexes)
% Extract non-outlier values:
nonOutliers = vector(~outlierIndexes)t% Compute the median absolute difference
meanValue = mean(vector)
% Compute the absolute differences. It will be a vector.
absoluteDeviation = abs(vector - meanValue)
% Compute the median of the absolute differences
mad = median(absoluteDeviation)
% Find outliers. They're outliers if the absolute difference
% is more than some factor times the mad value.
sensitivityFactor = 6 % Whatever you want.
thresholdValue = sensitivityFactor * mad;
outlierIndexes = abs(absoluteDeviation) > thresholdValue
% Extract outlier values:
outliers = vector(outlierIndexes)
% Extract non-outlier values:
nonOutliers = vector(~outlierIndexes)

Iniciar sesión para comentar.

Image Analyst
Image Analyst el 12 de Mzo. de 2014
That's not large. It's just a fraction of the size of a typical digital image. You can use "deleteoutliers" from Brett Shoelson of the Mathworks: http://www.mathworks.com/matlabcentral/fileexchange/3961-deleteoutliers Or you could try the Median Absolute Deviation (a popular statistical method for detecting outliers) as demonstrated on an image in the file I attached.

7 comentarios

thank you Image Analyst but i tried deleteoutliers for modified Thomson tau and matlab ran all day without a solution. I will try your suggestions immediately. grazie
I don't know what modified Thomson tau is, but the other methods I mentioned take a couple of minutes at the most, not all day.
I had to look it up, since I’ve never heard of it either. It seems to be a technique used primarily in mechanical engineering. It is iterative, and begins by removing data it defines as ‘outliers’, then recalculates its criteria and continues removing data and recalculating its criteria until no data satisfy its definition of ‘outlier’.
Attached is the plot of the signal with peaks and dips for better understanding. Image Analyst, the command 'deleteoutliers' returns error massage on my matlab R2013b, what could be the possible cause and how do I handle it.
Please, I will appreciate any further assistance on how to get rid of the peaks and dips on the signal and how to fill the removed outliers points with an interpolation.
looking forward to your replies.
Plot again with markers. I don't know what's data and what's outliers. To me it looks like data only if it has the exact value of 0 and everything else is an outlier.
Arinze
Arinze el 15 de Mzo. de 2014
Editada: Arinze el 15 de Mzo. de 2014
new plot of the signal, Please my matlab doesnt recognise 'deleteoutliers' command, any idea why?
So is anything that is not exactly zero an outlier?

Iniciar sesión para comentar.

Trimming your values based on percentiles is quick and powerful -
vector = randi(100,100,1);
percntiles = prctile(vector,[5 95]); %5th and 95th percentile
outlierIndex = vector < percntiles(1) | vector > percntiles(2);
%remove outlier values
vector(outlierIndex) = [];

1 comentario

But something at the 1% or 99% or 100% percentile is not necessarily an outlier so you could be getting rid of good data. It's quick but I wouldn't call it powerful. I'd call it risky, unless you know for a fact that you have a certain specific amount of noise present.

Iniciar sesión para comentar.

Amir H. Souri
Amir H. Souri el 26 de Jun. de 2017
Hi, I may be late, but I just want to point out that definition of outlier is totally subjective. In order to find them, you need to estimate the probably distribution of your data, and fit a distribution (say for example Gaussian), and check whether it is statistically significant (you may use Kolmogorov–Smirnov test or a bootstrap method). Then, you will be able to identify the outliers by defining the confidence interval. For example, you can say any data within 95% confidence interval are acceptable and others can be ignored as outliers. As I mentioned there is no absolute answer, and it totally depends on the nature of data and how strict you want to be in regards to the confidence interval.
Good luck!

Preguntada:

el 12 de Mzo. de 2014

Comentada:

el 14 de Ag. de 2018

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by