Missing Data: Types & Techniques

What Is Missing Data?

Missing data refers to the absence of expected values in a data set, which can occur due to sensor failures, data corruption, human error, or other factors. For engineers, missing data can lead to inaccurate models, unreliable simulations, and incorrect conclusions, which can potentially affect system performance and decision-making. Addressing missing data is necessary to maintain data integrity and ensure accurate results.

A line graph with missing data. Blue markers and lines display data points on the x-axis from 0 to 9. The y-axis values range from 1 to 10, with gaps at the third and sixth data points indicating missing values. — Graph illustrating data with missing values, showing gaps where data points are unavailable.

Types of Missing Data

Missing data can be classified into three categories; identifying the right category can help you select an appropriate fill method:

Missing at random (MAR): The variable with missing values is dependent on other variables in the data set. For instance, a rooftop solar installation relaying irradiance level, grid voltage, frequency, or other telemetry data would have missing values at night or during rainy days because there isn’t enough solar irradiance to power up the system; the missing values of grid voltage or frequency are caused by poor irradiance levels.
Missing completely at random (MCAR): The underlying cause of missing values is completely unrelated to any other variable in the data set. For example, missing packets in weather telemetry could result from malfunctioning sensors or high channel noise.
Missing not at random (MNAR): The underlying cause of missing data is related to the variable itself. For example, if a sensor relaying temperature information has reached its measurement limits, it would result in missing values in the form of its saturated thresholds.

A line graph showing missing data and interpolated values, with day on the x-axis and weight in kg on the y-axis. Circles represent the missing weight values filled using an interpolation technique. — Filling data missing entries using interpolation in MATLAB. (See code.)

Identifying missing data might seem like a simple task, but finding the right replacement values is where the real challenge—and opportunity—lies. Engineers can start by spotting gaps using visualization techniques or flagging invalid entries. Filling in missing data isn’t just about plugging in numbers; it’s about making smart, data-driven choices to maintain accuracy and reliability, ensuring accurate results that drive better models, simulations, and decisions.

Learn more about data cleaning techniques

Best Practices for Handling Missing Data

Identify Missing Data

Start by systematically detecting missing data in your data set. Engineers can use visualization tools, statistical tests, and anomaly detection techniques to uncover patterns and assess the extent of missingness. Understanding where and why data is missing is the first step in choosing the right approach to address it.

Document Your Approach

Keep clear records of how missing data is handled in your analysis, including the type of missing data, the techniques used to fill gaps, and any assumptions made. Transparent documentation ensures your work can be reproduced and validated by others, strengthening the reliability of your results.

Run Sensitivity Analyses

Test how different methods for handling missing data impact your results. By comparing outcomes across various imputation or deletion techniques, engineers can assess the stability of their conclusions and choose the most reliable approach for their specific data set. This approach helps prevent unintended distortions in models, simulations, and computations.

Minimize Bias

Improper handling of missing data can introduce bias, leading to inaccurate conclusions and flawed engineering decisions. To minimize this issue, consider using advanced techniques, such as machine learning–based methods or statistical approaches, to fill in missing values. Validating these techniques against benchmarks and performing exploratory data analysis ensure any introduced bias remains minimal and controlled.

Strategies for Handling Missing Data with MATLAB

Simple Imputation Methods

MATLAB offers straightforward functions for basic imputation, such as replacing missing values with the mean, median, or mode of available data. One approach is to use the fillmissing function to replace missing values with the mean or median, making sure your data stays consistent. For example:

% Sample data with missing values
A = [1, 2, NaN, 4, NaN, 6, 7, NaN, 9, 10]

% Fill missing values using a moving mean with a window length of 5
filledData = fillmissing(A, 'movmean', 5);

% Display the data after filling missing values
disp('Data after filling missing values using moving mean with window length of 5');
disp(filledData)

A = 1×10
     1     2   NaN     4   NaN     6     7   NaN     9    10

Data after filling missing values using moving mean with window length of 5
    1.0000    2.0000    2.3333    4.0000    5.6667    6.0000    7.0000    8.0000    9.0000   10.0000

Advanced Methods

When simple imputation methods fall short—especially in high-stakes or complex data sets—advanced techniques can help preserve relationships within the data and improve model performance. MATLAB provides a suite of tools and functions to handle missing data more intelligently.

For more complex scenarios, consider these approaches:

Model-based imputation (e.g., regression or decision trees)
Multiple imputation methods (such as MICE)
k-nearest neighbor (KNN) imputation
Expectation-maximization (EM) algorithms
Deep learning–based reconstruction (e.g., autoencoders for time series)

For more complex scenarios, the MATLAB functions fitrensemble or fitcensemble estimate missing values based on other variables in your data set. For example:

% Sample data with missing response values
X = [1, 2; 2, 3; 3, 4; 4, NaN; 5, 6];  % Predictor variables
y = [2.5; 3.5; NaN; 5.5; 6.5];         % Response variable with missing value

% Impute missing values in predictors (e.g., using column mean)
X_imputed = X;
for col = 1:size(X, 2)
   nanIndices = isnan(X(:, col));
   X_imputed(nanIndices, col) = mean(X(~nanIndices, col), 'omitnan');
end

% Train model using complete cases (after imputing X)
completeCases = ~isnan(y);
model = fitensemble(X_imputed(completeCases, :), y(completeCases), 'Bag', 50, 'Tree', 'Type', 'Regression');

% Predict and fill missing values in y
y(isnan(y)) = predict(model, X_imputed(isnan(y), :));

% Display filled data set
disp(table(X_imputed, y, 'VariableNames', {'Predictors', 'Response'}));

X = 5×2
     1     2
     2     3
     3     4
     4   NaN
     5     6
y = 5×1
    2.5000
    3.5000
    4.3600
    5.5000
    6.5000

    Predictors    Response
    __________    ________
    1       2        2.5 
    2       3        3.5 
    3       4       4.36 
    4    3.75        5.5 
    5       6        6.5

These methods let you predict missing data using machine learning models, keeping your analysis as accurate as possible.

Exploring Different Approaches with Data Cleaner App

If you’re looking to experiment with different data cleaning techniques, the Data Cleaner app in MATLAB provides an interactive way to try various methods for handling missing data. This app lets you explore different imputation and cleaning strategies visually, helping you fine-tune your approach and see how each method impacts your data in real time. It’s a great tool for engineers to quickly test out multiple approaches and make informed decisions about which technique works best for their data set.

Here’s a practical example of how to handle missing data with different methods in MATLAB:

% Sample data with missing values
data = [2, NaN, 5, 7; NaN, 3, 4, NaN; 6, 8, NaN, 10];

% Fill missing data using linear interpolation
data_filled_interp = fillmissing(data, 'linear');

% Fill missing data using moving average
data_filled_movmean = fillmissing(data, 'movmean', 2);

% Display filled data
disp('Data filled with Linear Interpolation:');
disp(data_filled_interp);
disp('Data filled with Moving Average:');
disp(data_filled_movmean);

Data filled with Linear Interpolation:
    2.0000   -2.0000    5.0000    7.0000
    4.0000    3.0000    4.0000    8.5000
    6.0000    8.0000    3.0000   10.0000

Data filled with Moving Average:
     2   NaN     5     7
     2     3     4     7
     6     8     4    10

In this example, you can see how the fillmissing function works with different methods, such as linear interpolation and moving averages, to fill missing data effectively and keep your analysis on track.

Frequently Asked Questions

1. What is the best technique for handling missing data when the pattern of missingness is random?

For random missing data (MCAR), simple imputation methods such as replacing missing values with the mean or median work well. Use the fillmissing function in MATLAB with 'constant' for quick solutions:

data_filled_mean = fillmissing(data, 'constant', mean(data, 'omitnan'));

For larger data sets, machine learning–based methods such as fitrensemble can predict missing values based on other variables.

2. How should I handle missing data in sensor or time series data?

For sensor and time series data, interpolation methods such as linear or spline interpolation are ideal. You can use the fillmissing function in MATLAB with the 'linear' option to fill gaps effectively:

data_filled_interp = fillmissing(data, 'linear');

This will fill the NaN values in data by linearly interpolating the missing values based on the surrounding data points.

If gaps are more complex, machine learning models such as fitrensemble can predict missing values based on trends in the data.

3. What method should I use for missing data when there’s potential bias or underlying structure?

For biased or structured missing data (MNAR), use advanced methods such as machine learningؘ–based techniques. fitrensemble can predict missing values while minimizing bias:

model = fitrensemble(X_train, y_train);
y_pred = predict(model, X_test);

It’s crucial to validate your approach to ensure your imputed data maintains the accuracy and integrity of your analysis, giving you confidence in your results. With the right tools and strategies, you can turn missing data into an opportunity for deeper insights, helping you make more informed engineering decisions.

Examples and How To

Reconstruct Missing Data - Example
Portfolios with Missing Data - Example

Software Reference

Missing Data in MATLAB - Documentation
Handling Missing Data and Outliers - Documentation

MATLAB Onramp

Get started