# testckfold

Compare accuracies of two classification models by repeated cross-validation

## Syntax

``h = testckfold(C1,C2,X1,X2)``
``h = testckfold(C1,C2,X1,X2,Y)``
``h = testckfold(___,Name,Value)``
``````[h,p,e1,e2] = testckfold(___)``````

## Description

`testckfold` statistically assesses the accuracies of two classification models by repeatedly cross-validating the two models, determining the differences in the classification loss, and then formulating the test statistic by combining the classification loss differences. This type of test is particularly appropriate when sample size is limited.

You can assess whether the accuracies of the classification models are different, or whether one classification model performs better than another. Available tests include a 5-by-2 paired t test, a 5-by-2 paired F test, and a 10-by-10 repeated cross-validation t test. For more details, see Repeated Cross-Validation Tests. To speed up computations, `testckfold` supports parallel computing (requires a Parallel Computing Toolbox™ license).

example

````h = testckfold(C1,C2,X1,X2)` returns the test decision that results from conducting a 5-by-2 paired F cross-validation test. The null hypothesis is the classification models `C1` and `C2` have equal accuracy in predicting the true class labels using the predictor and response data in the tables `X1` and `X2`. `h` = `1` indicates to reject the null hypothesis at the 5% significance level.`testckfold` conducts the cross-validation test by applying `C1` and `C2` to all predictor variables in `X1` and `X2`, respectively. The true class labels in `X1` and `X2` must be the same. The response variable names in `X1`, `X2`, `C1.ResponseName`, and `C2.ResponseName` must be the same.For examples of ways to compare models, see Tips.```

example

````h = testckfold(C1,C2,X1,X2,Y)` applies the full classification model or classification templates `C1` and `C2` to all predictor variables in the tables or matrices of data `X1` and `X2`, respectively. `Y` is the table variable name corresponding to the true class labels, or an array of true class labels.```

example

````h = testckfold(___,Name,Value)` uses any of the input arguments in the previous syntaxes and additional options specified by one or more `Name,Value` pair arguments. For example, you can specify the type of alternative hypothesis, the type of test, or the use of parallel computing.```

example

``````[h,p,e1,e2] = testckfold(___)``` also returns the p-value for the hypothesis test (`p`) and the respective classification losses for each cross-validation run and fold (`e1` and `e2`).```

## Examples

collapse all

At each node, `fitctree` chooses the best predictor to split using an exhaustive search by default. Alternatively, you can choose to split the predictor that shows the most evidence of dependence with the response by conducting curvature tests. This example statistically compares classification trees grown via exhaustive search for the best splits and grown by conducting curvature tests with interaction.

Load the `census1994` data set.

```load census1994.mat rng(1) % For reproducibility```

Grow a default classification tree using the training set, `adultdata`, which is a table. The response-variable name is `'salary'`.

`C1 = fitctree(adultdata,'salary')`
```C1 = ClassificationTree PredictorNames: {1x14 cell} ResponseName: 'salary' CategoricalPredictors: [2 4 6 7 8 9 10 14] ClassNames: [<=50K >50K] ScoreTransform: 'none' NumObservations: 32561 Properties, Methods ```

`C1` is a full `ClassificationTree` model. Its `ResponseName` property is `'salary'`. `C1` uses an exhaustive search to find the best predictor to split on based on maximal splitting gain.

Grow another classification tree using the same data set, but specify to find the best predictor to split using the curvature test with interaction.

`C2 = fitctree(adultdata,'salary','PredictorSelection','interaction-curvature')`
```C2 = ClassificationTree PredictorNames: {1x14 cell} ResponseName: 'salary' CategoricalPredictors: [2 4 6 7 8 9 10 14] ClassNames: [<=50K >50K] ScoreTransform: 'none' NumObservations: 32561 Properties, Methods ```

`C2` also is a full `ClassificationTree` model with `ResponseName` equal to `'salary'`.

Conduct a 5-by-2 paired F test to compare the accuracies of the two models using the training set. Because the response-variable names in the data sets and the `ResponseName` properties are all equal, and the response data in both sets are equal, you can omit supplying the response data.

`h = testckfold(C1,C2,adultdata,adultdata)`
```h = logical 0 ```

`h = 0` indicates to not reject the null hypothesis that `C1` and `C2` have the same accuracies at 5% level.

Conduct a statistical test comparing the misclassification rates of the two models using a 5-by-2 paired F test.

`load fisheriris;`

Create a naive Bayes template and a classification tree template using default options.

```C1 = templateNaiveBayes; C2 = templateTree;```

`C1` and `C2` are template objects corresponding to the naive Bayes and classification tree algorithms, respectively.

Test whether the two models have equal predictive accuracies. Use the same predictor data for each model. `testckfold` conducts a 5-by-2, two-sided, paired F test by default.

```rng(1); % For reproducibility h = testckfold(C1,C2,meas,meas,species)```
```h = logical 0 ```

`h = 0` indicates to not reject the null hypothesis that the two models have equal predictive accuracies.

Conduct a statistical test to assess whether a simpler model has better accuracy than a more complex model using a 10-by-10 repeated cross-validation t test.

Load Fisher's iris data set. Create a cost matrix that penalizes misclassifying a setosa iris twice as much as misclassifying a virginica iris as a versicolor.

```load fisheriris; tabulate(species)```
``` Value Count Percent setosa 50 33.33% versicolor 50 33.33% virginica 50 33.33% ```
```Cost = [0 2 2;2 0 1;2 1 0]; ClassNames = {'setosa' 'versicolor' 'virginica'};... % Specifies the order of the rows and columns in Cost```

The empirical distribution of the classes is uniform, and the classification cost is slightly imbalanced.

Create two ECOC templates: one that uses linear SVM binary learners and one that uses SVM binary learners equipped with the RBF kernel.

```tSVMLinear = templateSVM('Standardize',true); % Linear SVM by default tSVMRBF = templateSVM('KernelFunction','RBF','Standardize',true); C1 = templateECOC('Learners',tSVMLinear); C2 = templateECOC('Learners',tSVMRBF);```

`C1` and `C2` are ECOC template objects. `C1` is prepared for linear SVM. `C2` is prepared for SVM with an RBF kernel training.

Test the null hypothesis that the simpler model (`C1`) is at most as accurate as the more complex model (`C2`) in terms of classification costs. Conduct the 10-by-10 repeated cross-validation test. Request to return p-values and misclassification costs.

```rng(1); % For reproducibility [h,p,e1,e2] = testckfold(C1,C2,meas,meas,species,... 'Alternative','greater','Test','10x10t','Cost',Cost,... 'ClassNames',ClassNames)```
```h = logical 0 ```
```p = 0.1077 ```
```e1 = 10×10 0 0 0 0.0667 0 0.0667 0.1333 0 0.1333 0 0.0667 0.0667 0 0 0 0 0.0667 0 0.0667 0.0667 0 0 0 0 0 0.0667 0.0667 0.0667 0.0667 0.0667 0.0667 0.0667 0 0.0667 0 0.0667 0 0 0.0667 0 0.0667 0.0667 0.0667 0 0.0667 0.0667 0 0 0 0 0 0 0.1333 0 0 0.0667 0 0 0.0667 0.0667 0.0667 0.0667 0 0 0.0667 0 0 0.0667 0 0.0667 0.0667 0 0.0667 0.0667 0 0.1333 0 0.0667 0 0 0 0.0667 0.1333 0.0667 0.0667 0 0 0 0 0 0 0.0667 0.0667 0.0667 0.0667 0 0 0.0667 0 0 ```
```e2 = 10×10 0 0 0 0.1333 0 0.0667 0.1333 0 0.2667 0 0.0667 0.0667 0 0.1333 0 0 0 0.1333 0.1333 0.0667 0.1333 0.1333 0 0 0 0.0667 0 0.0667 0.0667 0.0667 0 0.1333 0 0.0667 0.1333 0.1333 0 0 0.0667 0 0.0667 0.0667 0.0667 0 0.0667 0.1333 0.1333 0 0 0.0667 0.0667 0 0.0667 0.0667 0 0.0667 0.1333 0 0.0667 0.0667 0.2000 0.0667 0 0 0.0667 0 0 0.1333 0 0.0667 0.2000 0 0 0.1333 0 0.1333 0 0.0667 0 0 0 0.0667 0.0667 0.0667 0.1333 0 0.2000 0 0 0 0.0667 0.0667 0 0.0667 0.1333 0 0 0.0667 0.1333 0.0667 ```

The p-value is slightly greater than 0.10, which indicates to retain the null hypothesis that the simpler model is at most as accurate as the more complex model. This result is consistent for any significance level (`Alpha`) that is at most 0.10.

`e1` and `e2` are 10-by-10 matrices containing misclassification costs. Row r corresponds to run r of the repeated cross validation. Column k corresponds to test-set fold k within a particular cross-validation run. For example, element (2,4) of `e2` is 0.1333. This value means that in cross-validation run 2, when the test set is fold 4, the estimated test-set misclassification cost is 0.1333.

Reduce classification model complexity by selecting a subset of predictor variables (features) from a larger set. Then, statistically compare the accuracy between the two models.

Load the `ionosphere` data set.

`load ionosphere;`

Train an ensemble of 100 boosted classification trees using AdaBoostM1 and the entire set of predictors. Inspect the importance measure for each predictor.

```t = templateTree('MaxNumSplits',1); % Weak-learner template tree object C = fitcensemble(X,Y,'Method','AdaBoostM1','Learners',t); predImp = predictorImportance(C); figure; bar(predImp); h = gca; h.XTick = 1:2:h.XLim(2); title('Predictor Importances'); xlabel('Predictor'); ylabel('Importance measure');```

Identify the top five predictors in terms of their importance.

```[~,idxSort] = sort(predImp,'descend'); idx5 = idxSort(1:5);```

Test whether the two models have equal predictive accuracies. Specify the reduced data set and then the full predictor data. Use parallel computing to speed up computations.

```s = RandStream('mlfg6331_64'); Options = statset('UseParallel',true,'Streams',s,'UseSubstreams',true); [h,p,e1,e2] = testckfold(C,C,X(:,idx5),X,Y,'Options',Options)```
```Starting parallel pool (parpool) using the 'local' profile ... Connected to the parallel pool (number of workers: 4). ```
```h = logical 0 ```
```p = 0.4161 ```
```e1 = 5×2 0.0686 0.0795 0.0800 0.0625 0.0914 0.0568 0.0400 0.0739 0.0914 0.0966 ```
```e2 = 5×2 0.0914 0.0625 0.1257 0.0682 0.0971 0.0625 0.0800 0.0909 0.0914 0.1193 ```

`testckfold` treats trained classification models as templates, and so it ignores all fitted parameters in `C`. That is, `testckfold` cross validates `C` using only the specified options and the predictor data to estimate the out-of-fold classification losses.

`h = 0` indicates to not reject the null hypothesis that the two models have equal predictive accuracies. This result favors the simpler ensemble.

## Input Arguments

collapse all

Classification model template or trained classification model, specified as any classification model template object or trained classification model object described in these tables.

Template TypeReturned By
Classification tree`templateTree`
Discriminant analysis`templateDiscriminant`
Ensemble (boosting, bagging, and random subspace)`templateEnsemble`
Error-correcting output codes (ECOC), multiclass classification model`templateECOC`
kNN `templateKNN`
Naive Bayes`templateNaiveBayes`
Support Vector Machine (SVM)`templateSVM`

For efficiency, supply a classification model template object instead of a trained classification model object.

Classification model template or trained classification model, specified as any classification model template object or trained classification model object described in these tables.

Template TypeReturned By
Classification tree`templateTree`
Discriminant analysis`templateDiscriminant`
Ensemble (boosting, bagging, and random subspace)`templateEnsemble`
Error-correcting output codes (ECOC), multiclass classification model`templateECOC`
kNN `templateKNN`
Naive Bayes`templateNaiveBayes`
Support Vector Machine (SVM)`templateSVM`

For efficiency, supply a classification model template object instead of a trained classification model object.

Data used to apply to the first full classification model or template, `C1`, specified as a numeric matrix or table.

Each row of `X1` corresponds to one observation, and each column corresponds to one variable. `testckfold` does not support multi-column variables and cell arrays other than cell arrays of character vectors.

`X1` and `X2` must be of the same data type, and `X1`, `X2`, `Y` must have the same number of observations.

If you specify `Y` as an array, then `testckfold` treats all columns of `X1` as separate predictor variables.

Data Types: `double` | `single` | `table`

Data used to apply to the second full classification model or template, `C2`, specified as a numeric matrix or table.

Each row of `X2` corresponds to one observation, and each column corresponds to one variable. `testckfold` does not support multi-column variables and cell arrays other than cell arrays of character vectors.

`X1` and `X2` must be of the same data type, and `X1`, `X2`, `Y` must have the same number of observations.

If you specify `Y` as an array, then `testckfold` treats all columns of `X2` as separate predictor variables.

Data Types: `double` | `single` | `table`

True class labels, specified as a categorical, character, or string array, a logical or numeric vector, a cell array of character vectors, or a character vector or string scalar.

• For a character vector or string scalar, `X1` and `X2` must be tables, their response variables must have the same name and values, and `Y` must be the common variable name. For example, if `X1.Labels` and `X2.Labels` are the response variables, then `Y` is `'Labels'` and `X1.Labels` and `X2.Labels` must be equivalent.

• For all other supported data types, `Y` is an array of true class labels.

• If `Y` is a character array, then each element must correspond to one row of the array.

• `X1`, `X2`, `Y` must have the same number of observations (rows).

• If both of these statements are true, then you can omit supplying `Y`.

• `X1` and `X2` are tables containing the same response variable (values and name).

• `C1` and `C2` are full classification models containing `ResponseName` properties specifying the response variable names in `X1` and `X2`.

Consequently, `testckfold` uses the common response variable in the tables. For example, if the response variables in the tables are `X1.Labels` and `X2.Labels`, and the values of `C1.ResponseName` and `C2.ResponseName` are `'Labels'`, then you do not have to supply `Y`.

Data Types: `categorical` | `char` | `string` | `logical` | `single` | `double` | `cell`

### Name-Value Pair Arguments

Specify optional comma-separated pairs of `Name,Value` arguments. `Name` is the argument name and `Value` is the corresponding value. `Name` must appear inside quotes. You can specify several name and value pair arguments in any order as `Name1,Value1,...,NameN,ValueN`.

Example: `'Alternative','greater','Test','10x10t','Options',statsset('UseParallel',true)` specifies to test whether the first set of first predicted class labels is more accurate than the second set, to conduct the 10-by-10 t test, and to use parallel computing for cross-validation.

Hypothesis test significance level, specified as the comma-separated pair consisting of `'Alpha'` and a scalar value in the interval (0,1).

Example: `'Alpha',0.1`

Data Types: `single` | `double`

Alternative hypothesis to assess, specified as the comma-separated pair consisting of `'Alternative'` and one of the values listed in the table.

ValueAlternative Hypothesis DescriptionSupported Tests
`'unequal'` (default)For predicting `Y`, the set of predictions resulting from `C1` applied to `X1` and `C2` applied to `X2` have unequal accuracies.`'5x2F'`, `'5x2t'`, and `'10x10t'`
`'greater'`For predicting `Y`, the set of predictions resulting from `C1` applied to `X1` is more accurate than `C2` applied to `X2`. `'5x2t'` and `'10x10t'`
`'less'`For predicting `Y`, the set of predictions resulting from `C1` applied to `X1` is less accurate than `C2` applied to `X2`.`'5x2t'` and `'10x10t'`

For details on supported tests, see `Test`.

Example: `'Alternative','greater'`

Flag identifying categorical predictors in the first test-set predictor data (`X1`), specified as the comma-separated pair consisting of `'X1CategoricalPredictors'` and one of the following:

• A numeric vector with indices from `1` through `p`, where `p` is the number of columns of `X1`.

• A logical vector of length `p`, where a `true` entry means that the corresponding column of `X1` is a categorical variable.

• `'all'`, meaning all predictors are categorical.

Specification of `X1CategoricalPredictors` is appropriate if:

• At least one predictor is categorical and `C1` is a classification tree, an ensemble of classification trees, an ECOC model, or a naive Bayes classification model.

• All predictors are categorical and `C1` is a kNN classification model.

If you specify `X1CategoricalPredictors` for any other case, then `testckfold` throws an error. For example, the function cannot train SVM learners using categorical predictors.

The default is `[]`, which indicates that there are no categorical predictors.

Example: `'X1CategoricalPredictors','all'`

Data Types: `single` | `double` | `logical` | `char` | `string`

Flag identifying categorical predictors in the second test-set predictor data (`X2`), specified as the comma-separated pair consisting of `'X2CategoricalPredictors'` and one of the following:

• A numeric vector with indices from `1` through `p`, where `p` is the number of columns of `X2`.

• A logical vector of length `p`, where a `true` entry means that the corresponding column of `X2` is a categorical variable.

• `'all'`, meaning all predictors are categorical.

Specification of `X2CategoricalPredictors` is appropriate if:

• At least one predictor is categorical and `C2` is a classification tree, an ensemble of classification trees, an ECOC model, or a naive Bayes classification model.

• All predictors are categorical and `C2` is a kNN classification model.

If you specify `X2CategoricalPredictors` for any other case, then `testckfold` throws an error. For example, the function cannot train SVM learners using categorical predictors.

The default is `[]`, which indicates that there are no categorical predictors.

Example: `'X2CategoricalPredictors','all'`

Data Types: `single` | `double` | `logical` | `char` | `string`

Class names, specified as the comma-separated pair consisting of `'ClassNames'` and a categorical, character, or string array, logical or numeric vector, or cell array of character vectors. You must set `ClassNames` using the data type of `Y`.

If `ClassNames` is a character array, then each element must correspond to one row of the array.

Use `ClassNames` to:

• Specify the order of any input argument dimension that corresponds to class order. For example, use `ClassNames` to specify the order of the dimensions of `Cost`.

• Select a subset of classes for testing. For example, suppose that the set of all distinct class names in `Y` is `{'a','b','c'}`. To train and test models using observations from classes `'a'` and `'c'` only, specify `'ClassNames',{'a','c'}`.

The default is the set of all distinct class names in `Y`.

Example: `'ClassNames',{'b','g'}`

Data Types: `single` | `double` | `logical` | `char` | `string` | `cell` | `categorical`

Classification cost, specified as the comma-separated pair consisting of `'Cost'` and a square matrix or structure array.

• If you specify the square matrix `Cost`, then `Cost(i,j)` is the cost of classifying a point into class `j` if its true class is `i`. That is, the rows correspond to the true class and the columns correspond to the predicted class. To specify the class order for the corresponding rows and columns of `Cost`, additionally specify the `ClassNames` name-value pair argument.

• If you specify the structure `S`, then `S` must have two fields:

• `S.ClassNames`, which contains the class names as a variable of the same data type as `Y`. You can use this field to specify the order of the classes.

• `S.ClassificationCosts`, which contains the cost matrix, with rows and columns ordered as in `S.ClassNames`

For cost-sensitive testing use, `testcholdout`.

It is a best practice to supply the same cost matrix used to train the classification models.

The default is `Cost(i,j) = 1` if ```i ~= j```, and `Cost(i,j) = 0` if ```i = j```.

Example: `'Cost',[0 1 2 ; 1 0 2; 2 2 0]`

Data Types: `double` | `single` | `struct`

Loss function, specified as the comma-separated pair consisting of `'LossFun'` and `'classiferror'`, `'binodeviance'`, `'exponential'`, `'hinge'`, or a function handle.

• The following table lists the available loss functions.

ValueLoss Function
`'binodeviance'`Binomial deviance
`'classiferror'`Classification error
`'exponential'`Exponential loss
`'hinge'`Hinge loss

• Specify your own function using function handle notation.

Suppose that `n = size(X,1)` is the sample size and there are `K` unique classes. Your function must have the signature `lossvalue = lossfun(C,S,W,Cost)`, where:

• The output argument `lossvalue` is a scalar.

• `lossfun` is the name of your function.

• `C` is an `n`-by-`K` logical matrix with rows indicating which class the corresponding observation belongs to. The column order corresponds to the class order in the `ClassNames` name-value pair argument.

Construct `C` by setting ```C(p,q) = 1``` if observation `p` is in class `q`, for each row. Set all other elements of row `p` to `0`.

• `S` is an `n`-by-`K` numeric matrix of classification scores. The column order corresponds to the class order in the `ClassNames` name-value pair argument. `S` is a matrix of classification scores.

• `W` is an `n`-by-1 numeric vector of observation weights. If you pass `W`, the software normalizes the weights to sum to `1`.

• `Cost` is a `K`-by-`K` numeric matrix of classification costs. For example, ```Cost = ones(K) - eye(K)``` specifies a cost of `0` for correct classification and a cost of `1` for misclassification.

Specify your function using `'LossFun',@lossfun`.

Parallel computing options, specified as the comma-separated pair consisting of `'Options'` and a structure array returned by `statset`. These options require Parallel Computing Toolbox. `testckfold` uses `'Streams'`, `'UseParallel'`, and `'UseSubtreams'` fields.

This table summarizes the available options.

OptionDescription
`'Streams'`

A `RandStream` object or cell array of such objects. If you do not specify `Streams`, the software uses the default stream or streams. If you specify `Streams`, use a single object except when the following are true:

• You have an open parallel pool.

• `UseParallel` is `true`.

• `UseSubstreams` is `false`.

In that case, use a cell array of the same size as the parallel pool. If a parallel pool is not open, then the software tries to open one (depending on your preferences), and `Streams` must supply a single random number stream.

`'UseParallel'`If you have Parallel Computing Toolbox, then you can invoke a pool of workers by setting `'UseParallel',true`.
`'UseSubstreams'`Set to `true` to compute in parallel using the stream specified by `'Streams'`. Default is `false`. For example, set `Streams` to a type allowing substreams, such as`'mlfg6331_64'` or `'mrg32k3a'`.

Example: `'Options',statset('UseParallel',true)`

Data Types: `struct`

Prior probabilities for each class, specified as the comma-separated pair consisting of `'Prior'` and `'empirical'`, `'uniform'`, a numeric vector, or a structure.

This table summarizes the available options for setting prior probabilities.

ValueDescription
`'empirical'`The class prior probabilities are the class relative frequencies in `Y`.
`'uniform'`All class prior probabilities are equal to 1/K, where K is the number of classes.
numeric vectorEach element is a class prior probability. Specify the order using the `ClassNames` name-value pair argument. The software normalizes the elements such that they sum to `1`.
structure

A structure `S` with two fields:

• `S.ClassNames` contains the class names as a variable of the same type as `Y`.

• `S.ClassProbs` contains a vector of corresponding prior probabilities. The software normalizes the elements such that they sum to `1`.

Example: `'Prior',struct('ClassNames',{{'setosa','versicolor'}},'ClassProbs',[1,2])`

Data Types: `char` | `string` | `single` | `double` | `struct`

Test to conduct, specified as the comma-separated pair consisting of `'Test'` and one of he following: `'5x2F'`, `'5x2t'`, `'10x10t'`.

ValueDescriptionSupported Alternative Hypothesis
`'5x2F'` (default)5-by-2 paired F test. Appropriate for two-sided testing only.`'unequal'`
`'5x2t'`5-by-2 paired t test`'unequal'`, `'less'`, `'greater'`
`'10x10t'`10-by-10 repeated cross-validation t test`'unequal'`, `'less'`, `'greater'`

For details on the available tests, see Repeated Cross-Validation Tests. For details on supported alternative hypotheses, see `Alternative`.

Example: `'Test','10x10t'`

Verbosity level, specified as the comma-separated pair consisting of `'Verbose'` and `0`, `1`, or `2`. `Verbose` controls the amount of diagnostic information that the software displays in the Command Window during training of each cross-validation fold.

This table summarizes the available verbosity level options.

ValueDescription
`0`The software does not display diagnostic information.
`1`The software displays diagnostic messages every time it implements a new cross-validation run.
`2`The software displays diagnostic messages every time it implements a new cross-validation run, and every time it trains on a particular fold.

Example: `'Verbose',1`

Data Types: `double` | `single`

Observation weights, specified as the comma-separated pair consisting of `'Weights'` and a numeric vector.

The size of `Weights` must equal the number of rows of `X1`. The software weighs the observations in each row of `X` with the corresponding weight in `Weights`.

The software normalizes `Weights` to sum up to the value of the prior probability in the respective class.

Data Types: `double` | `single`

### Notes:

• `testckfold` treats trained classification models as templates. Therefore, it ignores all fitted parameters in the model. That is, `testckfold` cross-validates using only the options specified in the model and the predictor data.

• The repeated cross-validation tests depend on the assumption that the test statistics are asymptotically normal under the null hypothesis. Highly imbalanced cost matrices (for example, `Cost` = ```[0 100;1 0]```) and highly discrete response distributions (that is, most of the observations are in a small number of classes) might violate the asymptotic normality assumption. For cost-sensitive testing, use `testcholdout`.

• `NaN`s, `<undefined>` values, empty character vectors (`''`), empty strings (`""`), and `<missing>` values indicate missing data values.

## Output Arguments

collapse all

Hypothesis test result, returned as a logical value.

`h = 1` indicates the rejection of the null hypothesis at the `Alpha` significance level.

`h = 0` indicates failure to reject the null hypothesis at the `Alpha` significance level.

Data Types: `logical`

p-value of the test, returned as a scalar in the interval [0,1]. `p` is the probability that a random test statistic is at least as extreme as the observed test statistic, given that the null hypothesis is true.

`testckfold` estimates `p` using the distribution of the test statistic, which varies with the type of test. For details on test statistics, see Repeated Cross-Validation Tests.

Classification losses, returned as a numeric matrix. The rows of `e1` correspond to the cross-validation run and the columns correspond to the test fold.

`testckfold` applies the first test-set predictor data (`X1`) to the first classification model (`C1`) to estimate the first set of class labels.

`e1` summarizes the accuracy of the first set of class labels predicting the true class labels (`Y`) for each cross-validation run and fold. The meaning of the elements of `e1` depends on the type of classification loss.

Classification losses, returned as a numeric matrix. The rows of `e2` correspond to the cross-validation run and the columns correspond to the test fold.

`testckfold` applies the second test-set predictor data (`X2`) to the second classification model (`C2`) to estimate the second set of class labels.

`e2` summarizes the accuracy of the second set of class labels predicting the true class labels (`Y`) for each cross-validation run and fold. The meaning of the elements of `e2` depends on the type of classification loss.

collapse all

### Repeated Cross-Validation Tests

Repeated cross-validation tests form the test statistic for comparing the accuracies of two classification models by combining the classification loss differences resulting from repeatedly cross-validating the data. Repeated cross-validation tests are useful when sample size is limited.

To conduct an R-by-K test:

1. Randomly divide (stratified by class) the predictor data sets and true class labels into K sets, R times. Each division is called a run and each set within a run is called a fold. Each run contains the complete, but divided, data sets.

2. For runs r = 1 through R, repeat these steps for k = 1 through K:

1. Reserve fold k as a test set, and train the two classification models using their respective predictor data sets on the remaining K – 1 folds.

2. Predict class labels using the trained models and their respective fold k predictor data sets.

3. Estimate the classification loss by comparing the two sets of estimated labels to the true labels. Denote ${e}_{crk}$ as the classification loss when the test set is fold k in run r of classification model c.

4. Compute the difference between the classification losses of the two models:

`${\stackrel{^}{\delta }}_{rk}={e}_{1rk}-{e}_{2rk}.$`

At the end of a run, there are K classification losses per classification model.

3. Combine the results of step 2. For each r = 1 through R:

• Estimate the within-fold averages of the differences and their average: ${\overline{\delta }}_{r}=\frac{1}{K}\sum _{k=1}^{K}{\stackrel{^}{\delta }}_{kr}.$

• Estimate the overall average of the differences: $\overline{\delta }=\frac{1}{KR}\sum _{r=1}^{R}\sum _{k=1}^{K}{\stackrel{^}{\delta }}_{rk}.$

• Estimate the within-fold variances of the differences: ${s}_{r}^{2}=\frac{1}{K}\sum _{k=1}^{K}{\left({\stackrel{^}{\delta }}_{rk}-{\overline{\delta }}_{r}\right)}^{2}.$

• Estimate the average of the within-fold differences: ${\overline{s}}^{2}=\frac{1}{R}\sum _{r=1}^{R}{s}_{r}^{2}.$

• Estimate the overall sample variance of the differences: ${S}^{2}=\frac{1}{KR-1}\sum _{r=1}^{R}\sum _{k=1}^{K}{\left({\stackrel{^}{\delta }}_{rk}-\overline{\delta }\right)}^{2}.$

Compute the test statistic. All supported tests described here assume that, under H0, the estimated differences are independent and approximately normally distributed, with mean 0 and a finite, common standard deviation. However, these tests violate the independence assumption, and so the test-statistic distributions are approximate.

• For R = 2, the test is a paired test. The two supported tests are a paired t and F test.

• The test statistic for the paired t test is

`${t}_{paired}^{\ast }=\frac{{\stackrel{^}{\delta }}_{11}}{\sqrt{{\overline{s}}^{2}}}.$`

${t}_{paired}^{\ast }$ has a t-distribution with R degrees of freedom under the null hypothesis.

To reduce the effects of correlation between the estimated differences, the quantity ${\stackrel{^}{\delta }}_{11}$ occupies the numerator rather than $\overline{\delta }$.

5-by-2 paired t tests can be slightly conservative [4].

• The test statistic for the paired F test is

`${F}_{paired}^{\ast }=\frac{\frac{1}{RK}\sum _{r=1}^{R}\sum _{k=1}^{K}{\left({\stackrel{^}{\delta }}_{rk}\right)}^{2}}{{\overline{s}}^{2}}.$`

${F}_{paired}^{\ast }$ has an F distribution with RK and R degrees of freedom.

A 5-by-2 paired F test has comparable power to the 5-by-2 t test, but is more conservative [1].

• For R > 2, the test is a repeated cross-validation test. The test statistic is

`${t}_{CV}^{\ast }=\frac{\overline{\delta }}{S/\sqrt{\nu +1}}.$`

${t}_{CV}^{\ast }$ has a t distribution with ν degrees of freedom. If the differences were truly independent, then ν = RK – 1. In this case, the degrees of freedom parameter must be optimized.

For a 10-by-10 repeated cross-validation t test, the optimal degrees of freedom between 8 and 11 ([2] and [3]). `testckfold` uses ν = 10.

The advantage of repeated cross-validation tests over paired tests is that the results are more repeatable [3]. The disadvantage is that they require high computational resources.

### Classification Loss

Classification losses indicate the accuracy of a classification model or set of predicted labels. In general, for a fixed cost matrix, classification accuracy decreases as classification loss increases.

`testckfold` returns the classification losses (see `e1` and `e2`) under the alternative hypothesis (that is, the unrestricted classification losses). In the definitions that follow:

• The classification losses focus on the first classification model. The classification losses for the second model are similar.

• ntest is the test-set sample size.

• I(x) is the indicator function. If x is a true statement, then I(x) = 1. Otherwise, I(x) = 0.

• ${\stackrel{^}{p}}_{1j}$ is the predicted class assignment of classification model 1 for observation j.

• yj is the true class label of observation j.

• Binomial deviance has the form

`${e}_{1}=\frac{\sum _{j=1}^{{n}_{test}}{w}_{j}\mathrm{log}\left(1+\mathrm{exp}\left(-2{y}_{j}^{\prime }f\left({X}_{j}\right)\right)\right)}{\sum _{j=1}^{{n}_{test}}{w}_{j}}$`

where:

• yj = 1 for the positive class and -1 for the negative class.

• $f\left({X}_{j}\right)$ is the classification score.

The binomial deviance has connections to the maximization of the binomial likelihood function. For details on binomial deviance, see [5].

• Exponential loss is similar to binomial deviance and has the form

`${e}_{1}=\frac{\sum _{j=1}^{{n}_{test}}{w}_{j}\mathrm{exp}\left(-{y}_{j}f\left({X}_{j}\right)\right)}{\sum _{j=1}^{{n}_{test}}{w}_{j}}.$`

yj and $f\left({X}_{j}\right)$ take the same forms here as in the binomial deviance formula.

• Hinge loss has the form

`${e}_{1}=\frac{\sum _{j=1}^{n}{w}_{j}\mathrm{max}\left\{0,1-{y}_{j}\prime f\left({X}_{j}\right)\right\}}{\sum _{j=1}^{n}{w}_{j}},$`

yj and $f\left({X}_{j}\right)$ take the same forms here as in the binomial deviance formula.

Hinge loss linearly penalizes for misclassified observations and is related to the SVM objective function used for optimization. For more details on hinge loss, see [5].

• Misclassification rate, or classification error, is a scalar in the interval [0,1] representing the proportion of misclassified observations. That is, the misclassification rate for the first classification model is

`${e}_{1}=\frac{\sum _{j=1}^{{n}_{test}}{w}_{j}I\left({\stackrel{^}{p}}_{1j}\ne {y}_{j}\right)}{\sum _{j=1}^{{n}_{test}}{w}_{j}}.$`

## Tips

• Examples of ways to compare models include:

• Compare the accuracies of a simple classification model and a more complex model by passing the same set of predictor data.

• Compare the accuracies of two different models using two different sets of predictors.

• Perform various types of Feature Selection. For example, you can compare the accuracy of a model trained using a set of predictors to the accuracy of one trained on a subset or different set of predictors. You can arbitrarily choose the set of predictors, or use a feature selection technique like PCA or sequential feature selection (see `pca` and `sequentialfs`).

• If both of these statements are true, then you can omit supplying `Y`.

• `X1` and `X2` are tables containing the response variable and use the same response variable name.

• `C1` and `C2` are full classification models containing equal `ResponseName` properties (e.g. `strcmp(C1.ResponseName,C2.ResponseName)` = `1`).

Consequently, `testckfold` uses the common response variable in the tables.

• One way to perform cost-insensitive feature selection is:

1. Create a classification model template that characterizes the first classification model (`C1`).

2. Create a classification model template that characterizes the second classification model (`C2`).

3. Specify two predictor data sets. For example, specify `X1` as the full predictor set and `X2` as a reduced set.

4. Enter `testckfold(C1,C2,X1,X2,Y,'Alternative','less')`. If `testckfold` returns `1`, then there is enough evidence to suggest that the classification model that uses fewer predictors performs better than the model that uses the full predictor set.

Alternatively, you can assess whether there is a significant difference between the accuracies of the two models. To perform this assessment, remove the `'Alternative','less'` specification in step 4.`testckfold` conducts a two-sided test, and `h = 0` indicates that there is not enough evidence to suggest a difference in the accuracy of the two models.

• The tests are appropriate for the misclassification rate classification loss, but you can specify other loss functions (see `LossFun`). The key assumptions are that the estimated classification losses are independent and normally distributed with mean 0 and finite common variance under the two-sided null hypothesis. Classification losses other than the misclassification rate can violate this assumption.

• Highly discrete data, imbalanced classes, and highly imbalanced cost matrices can violate the normality assumption of classification loss differences.

## Algorithms

If you specify to conduct the 10-by-10 repeated cross-validation t test using `'Test','10x10t'`, then `testckfold` uses 10 degrees of freedom for the t distribution to find the critical region and estimate the p-value. For more details, see [2] and [3].

## Alternatives

• For test sets with larger sample sizes

• To implement variants of the McNemar test to compare two classification model accuracies

• For cost-sensitive testing using a chi-square or likelihood ratio test. The chi-square test uses `quadprog`, which requires an Optimization Toolbox™ license.

## References

[1] Alpaydin, E. “Combined 5 x 2 CV F Test for Comparing Supervised Classification Learning Algorithms.” Neural Computation, Vol. 11, No. 8, 1999, pp. 1885–1992.

[2] Bouckaert. R. “Choosing Between Two Learning Algorithms Based on Calibrated Tests.” International Conference on Machine Learning, 2003, pp. 51–58.

[3] Bouckaert, R., and E. Frank. “Evaluating the Replicability of Significance Tests for Comparing Learning Algorithms.” Advances in Knowledge Discovery and Data Mining, 8th Pacific-Asia Conference, 2004, pp. 3–12.

[4] Dietterich, T. “Approximate statistical tests for comparing supervised classification learning algorithms.” Neural Computation, Vol. 10, No. 7, 1998, pp. 1895–1923.

[5] Hastie, T., R. Tibshirani, and J. Friedman. The Elements of Statistical Learning, 2nd Ed. New York: Springer, 2008.