How can I detect and remove outliers from a large dataset?

Question

0 votos

plot.jpg

I am presently trying to process a large dataset (n = 5000000) and I am really facing challenges writing codes that could detect and remove all the outliers present in the dataset. I tried using modified thomson tau method but it didnt work and I am presently trying to apply modified z- score method but still cant make a head go with the matlab codes.

Attached is the plot of the signal with peaks and dips for better understanding. I also want to fill the deleted outlier points with an interpolation and would appreciate a suggestion.

Please, I will appreciate any further assistance on how to get rid of the peaks and dips on the signal and how to fill the removed outliers points with an interpolation.

I will appreciate any suggestions on other methods to use to remove the outliers and if possible codes for the method.

thank you.

2 comentarios
Mostrar Ninguno Ocultar Ninguno

Star Strider el 12 de Mzo. de 2014

Do you have any trends in your data that you could model, perhaps with nlinfit or other regression routines? I have no idea what you are doing or what your data are, but detecting trends and other patterns first could make your task easier.

Arinze el 15 de Mzo. de 2014

Star Strider, I attached a picture of the plot for better understanding.

Iniciar sesión para comentar.

Iniciar sesión para responder a esta pregunta.

Follow Question

Answer 1

Shahab B el 30 de Sept. de 2016

1 voto

How can I used it for simple data such as: main=[0 347.666506871168 97.948966303887 98.8584847142621 96.4002074686564];

note that the outlier data is = 347.666506871168 .

4 comentarios
Mostrar 2 comentarios más antiguos Ocultar 2 comentarios más antiguos

Image Analyst el 30 de Sept. de 2016

Abrir en MATLAB Online

Try this

clc;
clearvars;
close all;
workspace;
fontSize = 30;
vector = [0 347.666506871168 97.948966303887 98.8584847142621 96.4002074686564];
%---------------------------------------------------------------
% This won't work.  Not enough data to determine percentiles.
percntiles = prctile(vector,[5 95]) %5th and 95th percentile
outlierIndexes = vector < percntiles(1) | vector > percntiles(2);
% Extract outlier values:
outliers = vector(outlierIndexes)
% Extract non-outlier values:
nonOutliers = vector(~outlierIndexes)
%---------------------------------------------------------------
% This will work.
% Compute the median absolute difference
meanValue = mean(vector)
% Compute the absolute differences.  It will be a vector.
absoluteDeviation = abs(vector - meanValue)
% Compute the median of the absolute differences
mad = median(absoluteDeviation)
% Find outliers.  They're outliers if the absolute difference
% is more than some factor times the mad value.
sensitivityFactor = 6 % Whatever you want.
thresholdValue = sensitivityFactor * mad;
outlierIndexes = abs(absoluteDeviation) > thresholdValue
% Extract outlier values:
outliers = vector(outlierIndexes)
% Extract non-outlier values:
nonOutliers = vector(~outlierIndexes)
%---------------------------------------------------------------
% Fancy plots in the following section.  Delete if you don't need it.
% Show the original data and the absolute deviation
subplot(2, 1, 1);
bar(vector);
hold on;
line(xlim, [meanValue, meanValue], 'Color', 'r', 'LineWidth', 2);
grid on;
title('Original Data', 'FontSize', fontSize);
message = sprintf('Mean Value = %.2f', meanValue);
text(3, 150, message, 'FontSize', 18, 'Color', 'r');
subplot(2, 1, 2);
bar(absoluteDeviation);
grid on;
title('Absolute Deviations', 'FontSize', fontSize);
% Put a line for the mad.
line(xlim, [mad, mad], 'Color', 'r', 'LineWidth', 2);
message = sprintf('MAD Value = %.2f', mad);
text(3, 50, message, 'FontSize', 18, 'Color', 'r');
% Put a line for the mad.
line(xlim, [thresholdValue, thresholdValue], 'Color', 'r', 'LineWidth', 2);
message = sprintf('Outlier Threshold Value = %.2f', thresholdValue);
text(3, 200, message, 'FontSize', 18, 'Color', 'r');
% Set up figure properties:
% Enlarge figure to full screen.
set(gcf, 'Units', 'Normalized', 'OuterPosition', [0 0 1 1]);
% Get rid of tool bar and pulldown menus that are along top of figure.
set(gcf, 'Toolbar', 'none', 'Menu', 'none');
% Give a name to the title bar.
set(gcf, 'Name', 'Demo by ImageAnalyst', 'NumberTitle', 'Off')

Image Analyst el 21 de Nov. de 2016

There are several definitions of MAD. My code above does definition 1.2.1 as listed on this page https://en.wikipedia.org/wiki/Average_absolute_deviation which gives 4 definitions using all combinations of mean and median. You're welcome to use whichever of those definitions best meets your needs.

Nivodi el 14 de Ag. de 2018

Abrir en MATLAB Online

Image Analyst, how can I apply this part of your code to several columns?

% Compute the median absolute difference
meanValue = mean(vector)
% Compute the absolute differences.  It will be a vector.
absoluteDeviation = abs(vector - meanValue)
% Compute the median of the absolute differences
mad = median(absoluteDeviation)
% Find outliers.  They're outliers if the absolute difference
% is more than some factor times the mad value.
sensitivityFactor = 6 % Whatever you want.
thresholdValue = sensitivityFactor * mad;
outlierIndexes = abs(absoluteDeviation) > thresholdValue
% Extract outlier values:
outliers = vector(outlierIndexes)
% Extract non-outlier values:
nonOutliers = vector(~outlierIndexes)t% Compute the median absolute difference
meanValue = mean(vector)
% Compute the absolute differences.  It will be a vector.
absoluteDeviation = abs(vector - meanValue)
% Compute the median of the absolute differences
mad = median(absoluteDeviation)
% Find outliers.  They're outliers if the absolute difference
% is more than some factor times the mad value.
sensitivityFactor = 6 % Whatever you want.
thresholdValue = sensitivityFactor * mad;
outlierIndexes = abs(absoluteDeviation) > thresholdValue
% Extract outlier values:
outliers = vector(outlierIndexes)
% Extract non-outlier values:
nonOutliers = vector(~outlierIndexes)

Iniciar sesión para comentar.

Answer 2

Image Analyst el 12 de Mzo. de 2014

0 votos

median_absolute_deviation.m

That's not large. It's just a fraction of the size of a typical digital image. You can use "deleteoutliers" from Brett Shoelson of the Mathworks: http://www.mathworks.com/matlabcentral/fileexchange/3961-deleteoutliers Or you could try the Median Absolute Deviation (a popular statistical method for detecting outliers) as demonstrated on an image in the file I attached.

7 comentarios
Mostrar 5 comentarios más antiguos Ocultar 5 comentarios más antiguos

Arinze el 15 de Mzo. de 2014

Editada: Arinze el 15 de Mzo. de 2014

plot1.jpg

new plot of the signal, Please my matlab doesnt recognise 'deleteoutliers' command, any idea why?

Image Analyst el 15 de Mzo. de 2014

So is anything that is not exactly zero an outlier?

Iniciar sesión para comentar.

Answer 3

Tim leonard el 12 de Mzo. de 2014

Abrir en MATLAB Online

0 votos

Trimming your values based on percentiles is quick and powerful -

    vector = randi(100,100,1);
    percntiles = prctile(vector,[5 95]); %5th and 95th percentile
    outlierIndex = vector < percntiles(1) | vector > percntiles(2);
    %remove outlier values
    vector(outlierIndex) = [];

1 comentario
Mostrar -1 comentarios más antiguos Ocultar -1 comentarios más antiguos

Image Analyst el 12 de Mzo. de 2014

But something at the 1% or 99% or 100% percentile is not necessarily an outlier so you could be getting rid of good data. It's quick but I wouldn't call it powerful. I'd call it risky, unless you know for a fact that you have a certain specific amount of noise present.

Iniciar sesión para comentar.

Answer 4

Amir H. Souri el 26 de Jun. de 2017

0 votos

Hi, I may be late, but I just want to point out that definition of outlier is totally subjective. In order to find them, you need to estimate the probably distribution of your data, and fit a distribution (say for example Gaussian), and check whether it is statistically significant (you may use Kolmogorov–Smirnov test or a bootstrap method). Then, you will be able to identify the outliers by defining the confidence interval. For example, you can say any data within 95% confidence interval are acceptable and others can be ignored as outliers. As I mentioned there is no absolute answer, and it totally depends on the nature of data and how strict you want to be in regards to the confidence interval.

Good luck!

0 comentarios
Mostrar -2 comentarios más antiguos Ocultar -2 comentarios más antiguos

Iniciar sesión para comentar.

How can I detect and remove outliers from a large dataset?

2 comentarios
Mostrar Ninguno Ocultar Ninguno

Respuestas (4)

4 comentarios
Mostrar 2 comentarios más antiguos Ocultar 2 comentarios más antiguos

7 comentarios
Mostrar 5 comentarios más antiguos Ocultar 5 comentarios más antiguos

1 comentario
Mostrar -1 comentarios más antiguos Ocultar -1 comentarios más antiguos

0 comentarios
Mostrar -2 comentarios más antiguos Ocultar -2 comentarios más antiguos

Categorías

Etiquetas

Community Treasure Hunt

How can I detect and remove outliers from a large dataset?

2 comentarios Mostrar Ninguno Ocultar Ninguno

Respuestas (4)

4 comentarios Mostrar 2 comentarios más antiguos Ocultar 2 comentarios más antiguos

7 comentarios Mostrar 5 comentarios más antiguos Ocultar 5 comentarios más antiguos

1 comentario Mostrar -1 comentarios más antiguos Ocultar -1 comentarios más antiguos

0 comentarios Mostrar -2 comentarios más antiguos Ocultar -2 comentarios más antiguos

Categorías

Etiquetas

Ver también

Community Treasure Hunt

2 comentarios
Mostrar Ninguno Ocultar Ninguno

4 comentarios
Mostrar 2 comentarios más antiguos Ocultar 2 comentarios más antiguos

7 comentarios
Mostrar 5 comentarios más antiguos Ocultar 5 comentarios más antiguos

1 comentario
Mostrar -1 comentarios más antiguos Ocultar -1 comentarios más antiguos

0 comentarios
Mostrar -2 comentarios más antiguos Ocultar -2 comentarios más antiguos