knnimpute

Impute missing data using nearest-neighbor method

Description

example

imputedData = knnimpute(data) returns imputedData after replacing NaNs in the input data with the corresponding value from the nearest-neighbor column. If the corresponding value from the nearest-neighbor column is also NaN, the next nearest column is used. The function calculates the Euclidean distance between observation columns by using only the rows with no NaN values. Thus, the data must have at least one row that contains no NaN.

example

imputedData = knnimpute(data,k) replaces NaNs in Data with a weighted mean of the k nearest-neighbor columns. The weights are inversely proportional to the distances from the neighboring columns.

example

imputedData = knnimpute(data,k,Name,Value) uses additional options specified by one or more name-value pair arguments. For example, imputedData = knnimpute(data,k,'Distance','mahalanobis') uses the Mahalanobis distance to compute the nearest-neighbor columns.

Examples

collapse all

The function knnimpute replaces NaNs in the input data with the corresponding value from the nearest-neighbor column. Consider the following matrix.

A = [1 2 5;4 5 7;NaN -1 8;7 6 0]
A = 4×3

     1     2     5
     4     5     7
   NaN    -1     8
     7     6     0

A(3,1) is NaN, and because column 2 is the closest column to column 1 in the Euclidean distance, knnimpute replaces the (3,1) entry of column 1 with the corresponding entry from column 2, which is -1.

results = knnimpute(A)
results = 4×3

     1     2     5
     4     5     7
    -1    -1     8
     7     6     0

The data must have at least one row without any NaN values for knnimpute to work. If all rows have NaN values, you can add a row where every observation (column) has identical values and call knnimpute on the updated matrix to replace the NaN values with the average of all column values for a given row.

B = [NaN 2 1; 3 NaN 1; 1 8 NaN]
B = 3×3

   NaN     2     1
     3   NaN     1
     1     8   NaN

B(4,:) = ones(1,3)
B = 4×3

   NaN     2     1
     3   NaN     1
     1     8   NaN
     1     1     1

imputed = knnimpute(B)
imputed = 4×3

    1.5000    2.0000    1.0000
    3.0000    2.0000    1.0000
    1.0000    8.0000    4.5000
    1.0000    1.0000    1.0000

You can then remove the added row.

imputed(4,:) = []
imputed = 3×3

    1.5000    2.0000    1.0000
    3.0000    2.0000    1.0000
    1.0000    8.0000    4.5000

Load a sample biological data set and imputes missing values in yeastvalues, where each row represents each gene and each column represents an experimental condition or observation.

load yeastdata

Remove data for empty spots where gene labels are set to 'EMPTY'.

emptySpots = strcmp('EMPTY',genes);
yeastvalues(emptySpots,:) = [];

knnimpute uses the next nearest column if the corresponding value from the nearest-neighbor column is also NaN. However, if all columns are NaNs, the function generates a warning for each row and keeps the rows instead of deleting the whole row in the returned output. The sample data contains some rows with all NaNs. Remove those rows to avoid the warnings.

yeastvalues(~any(~isnan(yeastvalues),2),:) = [];

Impute missing values.

imputedData1 = knnimpute(yeastvalues);

Check if there any NaN left after imputing data.

sum(any(isnan(imputedData1),2))
ans = 0

Use the 5-nearest neighbor search to get the nearest column.

imputedData2 = knnimpute(yeastvalues,5);

Change the distance metric to use the Minknowski distance.

imputedData3 = knnimpute(yeastvalues,5,'Distance','minkowski');

You can also specify the parameter for the distance metric. For instance, specify a different exponent (say 5) for the Minknowski distance.

imputedData4 = knnimpute(yeastvalues,5,'Distance','minkowski','DistArgs',5);

Input Arguments

collapse all

Input data, specified as a matrix. The data must have at least one row that contains no NaN because the function calculates the Euclidean distance between observation columns by using only the rows with no NaN values.

Data Types: double

Number of nearest neighbors, specified as a positive integer.

Data Types: double

Name-Value Pair Arguments

Specify optional comma-separated pairs of Name,Value arguments. Name is the argument name and Value is the corresponding value. Name must appear inside quotes. You can specify several name and value pair arguments in any order as Name1,Value1,...,NameN,ValueN.

Example: imputedData = knnimpute(data,k,'Distance','mahalanobis')

Distance metric, specified as a character vector, string, or function handle, as described in the following table.

Use the 'DistArgs' name-value pair in conjunction to specify parameters for the distance function. For instance, to specify a different exponent (say 5) for the Minknowski distance, use: output = knnimpute(data,3,'Distance','minkowski','DistArgs',5).

ValueDescription
'euclidean'

Euclidean distance (default).

'squaredeuclidean'

Squared Euclidean distance. (This option is provided for efficiency only. It does not satisfy the triangle inequality.)

'seuclidean'

Standardized Euclidean distance. Each coordinate difference between observations is scaled by dividing by the corresponding element of the standard deviation, S = nanstd(X). Use 'DistArgs' to specify another value for S.

'mahalanobis'

Mahalanobis distance using the sample covariance of X, C = nancov(X). Use 'DistArgs' to specify another value for C, where the matrix C is symmetric and positive definite.

'cityblock'

City block distance.

'minkowski'

Minkowski distance. The default exponent is 2. Use DistParameter to specify a different exponent P, where P is a positive scalar value of the exponent.

'chebychev'

Chebychev distance (maximum coordinate difference).

'cosine'

One minus the cosine of the included angle between points (treated as vectors).

'correlation'

One minus the sample correlation between points (treated as sequences of values).

'hamming'

Hamming distance, which is the percentage of coordinates that differ.

'jaccard'

One minus the Jaccard coefficient, which is the percentage of nonzero coordinates that differ.

'spearman'

One minus the sample Spearman's rank correlation between observations (treated as sequences of values).

@distfun

Custom distance function handle. A distance function has the form

function D2 = distfun(ZI,ZJ)
% calculation of distance
...
where

  • ZI is a 1-by-n vector containing a single observation.

  • ZJ is an m2-by-n matrix containing multiple observations. distfun must accept a matrix ZJ with an arbitrary number of observations.

  • D2 is an m2-by-1 vector of distances, and D2(k) is the distance between observations ZI and ZJ(k,:).

If your data is not sparse, you can generally compute distance more quickly by using a built-in distance instead of a function handle.

See pdist for more details.

Example: 'Distance','cosine'

Data Types: char | string | function_handle

Distance metric parameter values, specified as a positive scalar or cell array of values. Use 'DistArgs' together with 'Distance' to specify parameters for the distance function. For instance, to specify a different exponent (say 5) for the Minknowski distance, use: output = knnimpute(data,3,'Distance','minkowski','DistArgs',5)

Example: 'DistArgs',3

Data Types: double | cell

Weights used in the weighted mean calculation, specified as a numeric vector of length k.

Example: 'Weights',[0.3 0.5 0.2]

Data Types: double

Flag to use the median of k nearest neighbors instead of the weighted mean, specified as true or false.

Example: 'Median',true

Data Types: logical

Output Arguments

collapse all

Results after replacing NaNs from the input data with the corresponding value from the nearest-neighbor column, returned as a numeric matrix.

References

[1] Speed, T. (2003). Statistical Analysis of Gene Expression Microarray Data (Chapman & Hall/CRC).

[2] Hastie, T., Tibshirani, R., Sherlock, G., Eisen, M., Brown, P., and Botstein, D. (1999). “Imputing missing data for gene expression arrays”, Technical Report, Division of Biostatistics, Stanford University.

[3] Troyanskaya, O., Cantor, M., Sherlock, G., Brown, P., Hastie, T., Tibshirani, R., Botstein, D., and Altman, R. (2001). Missing value estimation methods for DNA microarrays. Bioinformatics 17(6), 520–525.

Introduced before R2006a