# predict

Predict labels using k-nearest neighbor classification model

## Syntax

``label = predict(mdl,X)``
``````[label,score,cost] = predict(mdl,X)``````

## Description

````label = predict(mdl,X)` returns a vector of predicted class labels for the predictor data in the table or matrix `X`, based on the trained k-nearest neighbor classification model `mdl`. See Predicted Class Label.```

example

``````[label,score,cost] = predict(mdl,X)``` also returns: A matrix of classification scores (`score`) indicating the likelihood that a label comes from a particular class. For k-nearest neighbor, scores are posterior probabilities. See Posterior Probability.A matrix of expected classification cost (`cost`). For each observation in `X`, the predicted class label corresponds to the minimum expected classification costs among all classes. See Expected Cost. ```

## Examples

collapse all

Create a k-nearest neighbor classifier for Fisher's iris data, where k = 5. Evaluate some model predictions on new data.

Load the Fisher iris data set.

```load fisheriris X = meas; Y = species;```

Create a classifier for five nearest neighbors. Standardize the noncategorical predictor data.

`mdl = fitcknn(X,Y,'NumNeighbors',5,'Standardize',1);`

Predict the classifications for flowers with minimum, mean, and maximum characteristics.

```Xnew = [min(X);mean(X);max(X)]; [label,score,cost] = predict(mdl,Xnew)```
```label = 3x1 cell {'versicolor'} {'versicolor'} {'virginica' } ```
```score = 3×3 0.4000 0.6000 0 0 1.0000 0 0 0 1.0000 ```
```cost = 3×3 0.6000 0.4000 1.0000 1.0000 0 1.0000 1.0000 1.0000 0 ```

The second and third rows of the score and cost matrices have binary values, which means all five nearest neighbors of the mean and maximum flower measurements have identical classifications.

## Input Arguments

collapse all

k-nearest neighbor classifier model, specified as a `ClassificationKNN` object.

Predictor data to be classified, specified as a numeric matrix or table.

Each row of `X` corresponds to one observation, and each column corresponds to one variable.

• For a numeric matrix:

• The variables that make up the columns of `X` must have the same order as the predictor variables used to train `mdl`.

• If you train `mdl` using a table (for example, `Tbl`), then `X` can be a numeric matrix if `Tbl` contains all numeric predictor variables. k-nearest neighbor classification requires homogeneous predictors. Therefore, to treat all numeric predictors in `Tbl` as categorical during training, set `'CategoricalPredictors','all'` when you train using `fitcknn`. If `Tbl` contains heterogeneous predictors (for example, numeric and categorical data types) and `X` is a numeric matrix, then `predict` throws an error.

• For a table:

• `predict` does not support multicolumn variables and cell arrays other than cell arrays of character vectors.

• If you train `mdl` using a table (for example, `Tbl`), then all predictor variables in `X` must have the same variable names and data types as those used to train `mdl` (stored in `mdl.PredictorNames`). However, the column order of `X` does not need to correspond to the column order of `Tbl`. Both `Tbl` and `X` can contain additional variables (response variables, observation weights, and so on), but `predict` ignores them.

• If you train `mdl` using a numeric matrix, then the predictor names in `mdl.PredictorNames` and corresponding predictor variable names in `X` must be the same. To specify predictor names during training, see the `PredictorNames` name-value pair argument of `fitcknn`. All predictor variables in `X` must be numeric vectors. `X` can contain additional variables (response variables, observation weights, and so on), but `predict` ignores them.

If you set `'Standardize',true` in `fitcknn` to train `mdl`, then the software standardizes the columns of `X` using the corresponding means in `mdl.Mu` and standard deviations in `mdl.Sigma`.

Data Types: `double` | `single` | `table`

## Output Arguments

collapse all

Predicted class labels for the observations (rows) in `X`, returned as a categorical array, character array, logical vector, vector of numeric values, or cell array of character vectors. `label` has length equal to the number of rows in `X`. The label is the class with minimal expected cost. See Predicted Class Label.

Predicted class scores or posterior probabilities, returned as a numeric matrix of size n-by-K. n is the number of observations (rows) in `X`, and K is the number of classes (in `mdl.ClassNames`). `score(i,j)` is the posterior probability that observation `i` in `X` is of class `j` in `mdl.ClassNames`. See Posterior Probability.

Data Types: `single` | `double`

Expected classification costs, returned as a numeric matrix of size n-by-K. n is the number of observations (rows) in `X`, and K is the number of classes (in `mdl.ClassNames`). `cost(i,j)` is the cost of classifying row `i` of `X` as class `j` in `mdl.ClassNames`. See Expected Cost.

Data Types: `single` | `double`

## Algorithms

collapse all

### Predicted Class Label

`predict` classifies by minimizing the expected misclassification cost:

`$\stackrel{^}{y}=\underset{y=1,...,K}{\mathrm{arg}\mathrm{min}}\sum _{j=1}^{K}\stackrel{^}{P}\left(j|x\right)C\left(y|j\right),$`

where:

• $\stackrel{^}{y}$ is the predicted classification.

• K is the number of classes.

• $\stackrel{^}{P}\left(j|x\right)$ is the posterior probability of class j for observation x.

• $C\left(y|j\right)$ is the cost of classifying an observation as y when its true class is j.

### Posterior Probability

Consider a vector (single query point) `xnew` and a model `mdl`.

• k is the number of nearest neighbors used in prediction, `mdl.NumNeighbors`.

• `nbd(mdl,xnew)` specifies the k nearest neighbors to `xnew` in `mdl.X`.

• `Y(nbd)` specifies the classifications of the points in `nbd(mdl,xnew)`, namely `mdl.Y(nbd)`.

• `W(nbd)` specifies the weights of the points in `nbd(mdl,xnew)`.

• `prior` specifies the priors of the classes in `mdl.Y`.

If the model contains a vector of prior probabilities, then the observation weights `W` are normalized by class to sum to the priors. This process might involve a calculation for the point `xnew`, because weights can depend on the distance from `xnew` to the points in `mdl.X`.

The posterior probability p(j|`xnew`) is

`$p\left(j|x\text{new}\right)=\frac{\sum _{i\in \text{nbd}}W\left(i\right){1}_{Y\left(X\left(i\right)\right)=j}}{\sum _{i\in \text{nbd}}W\left(i\right)}.$`

Here, ${1}_{Y\left(X\left(i\right)\right)=j}$ is `1` when `mdl.Y(i) = j`, and `0` otherwise.

### True Misclassification Cost

Two costs are associated with KNN classification: the true misclassification cost per class and the expected misclassification cost per observation.

You can set the true misclassification cost per class by using the `'Cost'` name-value pair argument when you run `fitcknn`. The value `Cost(i,j)` is the cost of classifying an observation into class `j` if its true class is `i`. By default, `Cost(i,j) = 1` if `i ~= j`, and `Cost(i,j) = 0` if `i = j`. In other words, the cost is `0` for correct classification and `1` for incorrect classification.

### Expected Cost

Two costs are associated with KNN classification: the true misclassification cost per class and the expected misclassification cost per observation. The third output of `predict` is the expected misclassification cost per observation.

Suppose you have `Nobs` observations that you want to classify with a trained classifier `mdl`, and you have `K` classes. You place the observations into a matrix `Xnew` with one observation per row. The command

`[label,score,cost] = predict(mdl,Xnew)`

returns a matrix `cost` of size `Nobs`-by-`K`, among other outputs. Each row of the `cost` matrix contains the expected (average) cost of classifying the observation into each of the `K` classes. `cost(n,j)` is

`$\sum _{i=1}^{K}\stackrel{^}{P}\left(i|Xnew\left(n\right)\right)C\left(j|i\right),$`

where

• K is the number of classes.

• $\stackrel{^}{P}\left(i|X\left(n\right)\right)$ is the posterior probability of class i for observation Xnew(n).

• $C\left(j|i\right)$ is the true misclassification cost of classifying an observation as j when its true class is i.