knnimpute

Impute missing data using nearest-neighbor method

Syntax

imputedData = knnimpute(data)

imputedData = knnimpute(data,k)

imputedData = knnimpute(data,k,Name,Value)

Description

imputedData = knnimpute(data) returns imputedData after replacing NaNs in the input data with the corresponding value from the nearest-neighbor column. If the corresponding value from the nearest-neighbor column is also NaN, the next nearest column is used. The function calculates the Euclidean distance between observation columns by using only the rows with no NaN values. Thus, the data must have at least one row that contains no NaN.

example

imputedData = knnimpute(data,k) replaces NaNs in Data with a weighted mean of the k nearest-neighbor columns. The weights are inversely proportional to the distances from the neighboring columns.

example

imputedData = knnimpute(data,k,Name,Value) uses additional options specified by one or more name-value pair arguments. For example, imputedData = knnimpute(data,k,'Distance','mahalanobis') uses the Mahalanobis distance to compute the nearest-neighbor columns.

example

Examples

collapse all

Impute Missing Data Using KNN

Open Live Script

The function knnimpute replaces NaNs in the input data with the corresponding value from the nearest-neighbor column. Consider the following matrix.

A = [1 2 5;4 5 7;NaN -1 8;7 6 0]

A = 4×3

     1     2     5
     4     5     7
   NaN    -1     8
     7     6     0

A(3,1) is NaN, and because column 2 is the closest column to column 1 in the Euclidean distance, knnimpute replaces the (3,1) entry of column 1 with the corresponding entry from column 2, which is -1.

results = knnimpute(A)

results = 4×3

     1     2     5
     4     5     7
    -1    -1     8
     7     6     0

The data must have at least one row without any NaN values for knnimpute to work. If all rows have NaN values, you can add a row where every observation (column) has identical values and call knnimpute on the updated matrix to replace the NaN values with the average of all column values for a given row.

B = [NaN 2 1; 3 NaN 1; 1 8 NaN]

B = 3×3

   NaN     2     1
     3   NaN     1
     1     8   NaN

B(4,:) = ones(1,3)

B = 4×3

   NaN     2     1
     3   NaN     1
     1     8   NaN
     1     1     1

imputed = knnimpute(B)

imputed = 4×3

    1.5000    2.0000    1.0000
    3.0000    2.0000    1.0000
    1.0000    8.0000    4.5000
    1.0000    1.0000    1.0000

You can then remove the added row.

imputed(4,:) = []

imputed = 3×3

    1.5000    2.0000    1.0000
    3.0000    2.0000    1.0000
    1.0000    8.0000    4.5000

Load a sample biological data set and imputes missing values in yeastvalues, where each row represents each gene and each column represents an experimental condition or observation.

load yeastdata

Remove data for empty spots where gene labels are set to 'EMPTY'.

emptySpots = strcmp('EMPTY',genes);
yeastvalues(emptySpots,:) = [];

knnimpute uses the next nearest column if the corresponding value from the nearest-neighbor column is also NaN. However, if all columns are NaNs, the function generates a warning for each row and keeps the rows instead of deleting the whole row in the returned output. The sample data contains some rows with all NaNs. Remove those rows to avoid the warnings.

yeastvalues(~any(~isnan(yeastvalues),2),:) = [];

Impute missing values.

imputedData1 = knnimpute(yeastvalues);

Check if there any NaN left after imputing data.

sum(any(isnan(imputedData1),2))

ans = 
0

Use the 5-nearest neighbor search to get the nearest column.

imputedData2 = knnimpute(yeastvalues,5);

Change the distance metric to use the Minknowski distance.

imputedData3 = knnimpute(yeastvalues,5,'Distance','minkowski');

You can also specify the parameter for the distance metric. For instance, specify a different exponent (say 5) for the Minknowski distance.

imputedData4 = knnimpute(yeastvalues,5,'Distance','minkowski','DistArgs',5);

Input Arguments

collapse all

`data` — Input data
matrix

Input data, specified as a matrix. The data must have at least one row that contains no NaN because the function calculates the Euclidean distance between observation columns by using only the rows with no NaN values.

Data Types: double

`k` — Number of nearest neighbors
positive integer

Number of nearest neighbors, specified as a positive integer.

Data Types: double

Name-Value Arguments

collapse all

Specify optional pairs of arguments as Name1=Value1,...,NameN=ValueN, where Name is the argument name and Value is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

Before R2021a, use commas to separate each name and value, and enclose Name in quotes.

Example: imputedData = knnimpute(data,k,'Distance','mahalanobis')

`Distance` — Distance metric
character vector | string | function handle

Distance metric, specified as a character vector, string, or function handle, as described in the following table.

Use the 'DistArgs' name-value pair in conjunction to specify parameters for the distance function. For instance, to specify a different exponent (say 5) for the Minknowski distance, use: output = knnimpute(data,3,'Distance','minkowski','DistArgs',5).

Value	Description
`'euclidean'`	Euclidean distance (default).
`'squaredeuclidean'`	Squared Euclidean distance. (This option is provided for efficiency only. It does not satisfy the triangle inequality.)
`'seuclidean'`	Standardized Euclidean distance. Each coordinate difference between observations is scaled by dividing by the corresponding element of the standard deviation, `S = nanstd(X)`. Use `'DistArgs'` to specify another value for `S`.
`'mahalanobis'`	Mahalanobis distance using the sample covariance of `X`, `C = nancov(X)`. Use `'DistArgs'` to specify another value for `C`, where the matrix `C` is symmetric and positive definite.
`'cityblock'`	City block distance.
`'minkowski'`	Minkowski distance. The default exponent is 2. Use `DistParameter` to specify a different exponent `P`, where `P` is a positive scalar value of the exponent.
`'chebychev'`	Chebychev distance (maximum coordinate difference).
`'cosine'`	One minus the cosine of the included angle between points (treated as vectors).
`'correlation'`	One minus the sample correlation between points (treated as sequences of values).
`'hamming'`	Hamming distance, which is the percentage of coordinates that differ.
`'jaccard'`	One minus the Jaccard coefficient, which is the percentage of nonzero coordinates that differ.
`'spearman'`	One minus the sample Spearman's rank correlation between observations (treated as sequences of values).
`@distfun`	Custom distance function handle. A distance function has the form function D2 = distfun(ZI,ZJ) % calculation of distance ... where `ZI` is a `1`-by-`n` vector containing a single observation. `ZJ` is an `m2`-by-`n` matrix containing multiple observations. `distfun` must accept a matrix `ZJ` with an arbitrary number of observations. `D2` is an `m2`-by-`1` vector of distances, and `D2(k)` is the distance between observations `ZI` and `ZJ(k,:)`. If your data is not sparse, you can generally compute distance more quickly by using a built-in distance instead of a function handle.

See pdist for more details.

Example: 'Distance','cosine'

Data Types: char | string | function_handle

`DistArgs` — Distance metric parameter values
positive scalar | cell array

Distance metric parameter values, specified as a positive scalar or cell array of values. Use 'DistArgs' together with 'Distance' to specify parameters for the distance function. For instance, to specify a different exponent (say 5) for the Minknowski distance, use: output = knnimpute(data,3,'Distance','minkowski','DistArgs',5)

Example: 'DistArgs',3

Data Types: double | cell

`Weights` — Weights used in weighted mean calculation
numeric vector of length `k`

Weights used in the weighted mean calculation, specified as a numeric vector of length k.

Example: 'Weights',[0.3 0.5 0.2]

Data Types: double

`Median` — Flag to use median of `k` nearest neighbors
`true` | `false`

Flag to use the median of k nearest neighbors instead of the weighted mean, specified as true or false.

Example: 'Median',true

Data Types: logical

Output Arguments

collapse all

`imputedData` — Results after replacing NaNs
numeric matrix

Results after replacing NaNs from the input data with the corresponding value from the nearest-neighbor column, returned as a numeric matrix.

References

[1] Speed, T. (2003). Statistical Analysis of Gene Expression Microarray Data (Chapman & Hall/CRC).

[2] Hastie, T., Tibshirani, R., Sherlock, G., Eisen, M., Brown, P., and Botstein, D. (1999). “Imputing missing data for gene expression arrays”, Technical Report, Division of Biostatistics, Stanford University.

[3] Troyanskaya, O., Cantor, M., Sherlock, G., Brown, P., Hastie, T., Tibshirani, R., Botstein, D., and Altman, R. (2001). Missing value estimation methods for DNA microarrays. Bioinformatics 17(6), 520–525.

Version History

Introduced before R2006a

knnimpute

Syntax

Description

Examples

Impute Missing Data Using KNN

Input Arguments

data — Input data matrix

k — Number of nearest neighbors positive integer

Name-Value Arguments

Distance — Distance metric character vector | string | function handle

DistArgs — Distance metric parameter values positive scalar | cell array

Weights — Weights used in weighted mean calculation numeric vector of length k

Median — Flag to use median of k nearest neighbors true | false

Output Arguments

imputedData — Results after replacing NaNs numeric matrix

References

Version History

See Also

`data` — Input data
matrix

`k` — Number of nearest neighbors
positive integer

`Distance` — Distance metric
character vector | string | function handle

`DistArgs` — Distance metric parameter values
positive scalar | cell array

`Weights` — Weights used in weighted mean calculation
numeric vector of length `k`

`Median` — Flag to use median of `k` nearest neighbors
`true` | `false`

`imputedData` — Results after replacing NaNs
numeric matrix