# testckfold

Compare accuracies of two classification models by repeated cross-validation

## Description

testckfold statistically assesses the accuracies of two classification models by repeatedly cross-validating the two models, determining the differences in the classification loss, and then formulating the test statistic by combining the classification loss differences. This type of test is particularly appropriate when sample size is limited.

You can assess whether the accuracies of the classification models are different, or whether one classification model performs better than another. Available tests include a 5-by-2 paired t test, a 5-by-2 paired F test, and a 10-by-10 repeated cross-validation t test. For more details, see Repeated Cross-Validation Tests. To speed up computations, testckfold supports parallel computing (requires a Parallel Computing Toolbox™ license).

example

h = testckfold(C1,C2,X1,X2) returns the test decision that results from conducting a 5-by-2 paired F cross-validation test. The null hypothesis is the classification models C1 and C2 have equal accuracy in predicting the true class labels using the predictor and response data in the tables X1 and X2. h = 1 indicates to reject the null hypothesis at the 5% significance level.

testckfold conducts the cross-validation test by applying C1 and C2 to all predictor variables in X1 and X2, respectively. The true class labels in X1 and X2 must be the same. The response variable names in X1, X2, C1.ResponseName, and C2.ResponseName must be the same.

For examples of ways to compare models, see Tips.

example

h = testckfold(C1,C2,X1,X2,Y) applies the full classification model or classification templates C1 and C2 to all predictor variables in the tables or matrices of data X1 and X2, respectively. Y is the table variable name corresponding to the true class labels, or an array of true class labels.

example

h = testckfold(___,Name,Value) uses any of the input arguments in the previous syntaxes and additional options specified by one or more Name,Value pair arguments. For example, you can specify the type of alternative hypothesis, the type of test, or the use of parallel computing.

example

[h,p,e1,e2] = testckfold(___) also returns the p-value for the hypothesis test (p) and the respective classification losses for each cross-validation run and fold (e1 and e2).

## Examples

collapse all

At each node, fitctree chooses the best predictor to split using an exhaustive search by default. Alternatively, you can choose to split the predictor that shows the most evidence of dependence with the response by conducting curvature tests. This example statistically compares classification trees grown via exhaustive search for the best splits and grown by conducting curvature tests with interaction.

rng(1) % For reproducibility

Grow a default classification tree using the training set, adultdata, which is a table. The response-variable name is 'salary'.

C1 =
ClassificationTree
PredictorNames: {1x14 cell}
ResponseName: 'salary'
CategoricalPredictors: [2 4 6 7 8 9 10 14]
ClassNames: [<=50K    >50K]
ScoreTransform: 'none'
NumObservations: 32561

Properties, Methods

C1 is a full ClassificationTree model. Its ResponseName property is 'salary'. C1 uses an exhaustive search to find the best predictor to split on based on maximal splitting gain.

Grow another classification tree using the same data set, but specify to find the best predictor to split using the curvature test with interaction.

C2 =
ClassificationTree
PredictorNames: {1x14 cell}
ResponseName: 'salary'
CategoricalPredictors: [2 4 6 7 8 9 10 14]
ClassNames: [<=50K    >50K]
ScoreTransform: 'none'
NumObservations: 32561

Properties, Methods

C2 also is a full ClassificationTree model with ResponseName equal to 'salary'.

Conduct a 5-by-2 paired F test to compare the accuracies of the two models using the training set. Because the response-variable names in the data sets and the ResponseName properties are all equal, and the response data in both sets are equal, you can omit supplying the response data.

h = logical
0

h = 0 indicates to not reject the null hypothesis that C1 and C2 have the same accuracies at 5% level.

Conduct a statistical test comparing the misclassification rates of the two models using a 5-by-2 paired F test.

Create a naive Bayes template and a classification tree template using default options.

C1 = templateNaiveBayes;
C2 = templateTree;

C1 and C2 are template objects corresponding to the naive Bayes and classification tree algorithms, respectively.

Test whether the two models have equal predictive accuracies. Use the same predictor data for each model. testckfold conducts a 5-by-2, two-sided, paired F test by default.

rng(1); % For reproducibility
h = testckfold(C1,C2,meas,meas,species)
h = logical
0

h = 0 indicates to not reject the null hypothesis that the two models have equal predictive accuracies.

Conduct a statistical test to assess whether a simpler model has better accuracy than a more complex model using a 10-by-10 repeated cross-validation t test.

Load Fisher's iris data set. Create a cost matrix that penalizes misclassifying a setosa iris twice as much as misclassifying a virginica iris as a versicolor.

tabulate(species)
Value    Count   Percent
setosa       50     33.33%
versicolor       50     33.33%
virginica       50     33.33%
Cost = [0 2 2;2 0 1;2 1 0];
ClassNames  = {'setosa' 'versicolor' 'virginica'};...
% Specifies the order of the rows and columns in Cost

The empirical distribution of the classes is uniform, and the classification cost is slightly imbalanced.

Create two ECOC templates: one that uses linear SVM binary learners and one that uses SVM binary learners equipped with the RBF kernel.

tSVMLinear = templateSVM('Standardize',true); % Linear SVM by default
tSVMRBF = templateSVM('KernelFunction','RBF','Standardize',true);
C1 = templateECOC('Learners',tSVMLinear);
C2 = templateECOC('Learners',tSVMRBF);

C1 and C2 are ECOC template objects. C1 is prepared for linear SVM. C2 is prepared for SVM with an RBF kernel training.

Test the null hypothesis that the simpler model (C1) is at most as accurate as the more complex model (C2) in terms of classification costs. Conduct the 10-by-10 repeated cross-validation test. Request to return p-values and misclassification costs.

rng(1); % For reproducibility
[h,p,e1,e2] = testckfold(C1,C2,meas,meas,species,...
'Alternative','greater','Test','10x10t','Cost',Cost,...
'ClassNames',ClassNames)
h = logical
0

p = 0.1077
e1 = 10×10

0         0         0    0.0667         0    0.0667    0.1333         0    0.1333         0
0.0667    0.0667         0         0         0         0    0.0667         0    0.0667    0.0667
0         0         0         0         0    0.0667    0.0667    0.0667    0.0667    0.0667
0.0667    0.0667         0    0.0667         0    0.0667         0         0    0.0667         0
0.0667    0.0667    0.0667         0    0.0667    0.0667         0         0         0         0
0         0    0.1333         0         0    0.0667         0         0    0.0667    0.0667
0.0667    0.0667         0         0    0.0667         0         0    0.0667         0    0.0667
0.0667         0    0.0667    0.0667         0    0.1333         0    0.0667         0         0
0    0.0667    0.1333    0.0667    0.0667         0         0         0         0         0
0    0.0667    0.0667    0.0667    0.0667         0         0    0.0667         0         0

e2 = 10×10

0         0         0    0.1333         0    0.0667    0.1333         0    0.2667         0
0.0667    0.0667         0    0.1333         0         0         0    0.1333    0.1333    0.0667
0.1333    0.1333         0         0         0    0.0667         0    0.0667    0.0667    0.0667
0    0.1333         0    0.0667    0.1333    0.1333         0         0    0.0667         0
0.0667    0.0667    0.0667         0    0.0667    0.1333    0.1333         0         0    0.0667
0.0667         0    0.0667    0.0667         0    0.0667    0.1333         0    0.0667    0.0667
0.2000    0.0667         0         0    0.0667         0         0    0.1333         0    0.0667
0.2000         0         0    0.1333         0    0.1333         0    0.0667         0         0
0    0.0667    0.0667    0.0667    0.1333         0    0.2000         0         0         0
0.0667    0.0667         0    0.0667    0.1333         0         0    0.0667    0.1333    0.0667

The p-value is slightly greater than 0.10, which indicates to retain the null hypothesis that the simpler model is at most as accurate as the more complex model. This result is consistent for any significance level (Alpha) that is at most 0.10.

e1 and e2 are 10-by-10 matrices containing misclassification costs. Row r corresponds to run r of the repeated cross validation. Column k corresponds to test-set fold k within a particular cross-validation run. For example, element (2,4) of e2 is 0.1333. This value means that in cross-validation run 2, when the test set is fold 4, the estimated test-set misclassification cost is 0.1333.

Reduce classification model complexity by selecting a subset of predictor variables (features) from a larger set. Then, statistically compare the accuracy between the two models.

Train an ensemble of 100 boosted classification trees using AdaBoostM1 and the entire set of predictors. Inspect the importance measure for each predictor.

t = templateTree('MaxNumSplits',1); % Weak-learner template tree object
predImp = predictorImportance(C);

bar(predImp)
h = gca;
h.XTick = 1:2:h.XLim(2);
title('Predictor Importances')
xlabel('Predictor')
ylabel('Importance measure')

Identify the top five predictors in terms of their importance.

[~,idxSort] = sort(predImp,'descend');
idx5 = idxSort(1:5);

Test whether the two models have equal predictive accuracies. Specify the reduced data set and then the full predictor data. Use parallel computing to speed up computations.

s = RandStream('mlfg6331_64');
Options = statset('UseParallel',true,'Streams',s,'UseSubstreams',true);

[h,p,e1,e2] = testckfold(C,C,X(:,idx5),X,Y,'Options',Options)
Starting parallel pool (parpool) using the 'local' profile ...
Connected to the parallel pool (number of workers: 6).
h = logical
0

p = 0.4161
e1 = 5×2

0.0686    0.0795
0.0800    0.0625
0.0914    0.0568
0.0400    0.0739
0.0914    0.0966

e2 = 5×2

0.0914    0.0625
0.1257    0.0682
0.0971    0.0625
0.0800    0.0909
0.0914    0.1193

testckfold treats trained classification models as templates, and so it ignores all fitted parameters in C. That is, testckfold cross validates C using only the specified options and the predictor data to estimate the out-of-fold classification losses.

h = 0 indicates to not reject the null hypothesis that the two models have equal predictive accuracies. This result favors the simpler ensemble.

## Input Arguments

collapse all

Classification model template or trained classification model, specified as any classification model template object or trained classification model object described in these tables.

Template TypeReturned By
Classification treetemplateTree
Discriminant analysistemplateDiscriminant
Ensemble (boosting, bagging, and random subspace)templateEnsemble
Error-correcting output codes (ECOC), multiclass classification modeltemplateECOC
kNN templateKNN
Naive BayestemplateNaiveBayes
Support Vector Machine (SVM)templateSVM
Trained Model TypeModel ObjectReturned By
Classification treeClassificationTreefitctree
Discriminant analysisClassificationDiscriminantfitcdiscr
Ensemble of bagged classification modelsClassificationBaggedEnsemblefitcensemble
Ensemble of classification modelsClassificationEnsemblefitcensemble
ECOC modelClassificationECOCfitcecoc
kNN ClassificationKNNfitcknn
Naive BayesClassificationNaiveBayesfitcnb
Neural networkClassificationNeuralNetwork (with observations in rows)fitcnet
SVMClassificationSVMfitcsvm

For efficiency, supply a classification model template object instead of a trained classification model object.

Classification model template or trained classification model, specified as any classification model template object or trained classification model object described in these tables.

Template TypeReturned By
Classification treetemplateTree
Discriminant analysistemplateDiscriminant
Ensemble (boosting, bagging, and random subspace)templateEnsemble
Error-correcting output codes (ECOC), multiclass classification modeltemplateECOC
kNN templateKNN
Naive BayestemplateNaiveBayes
Support Vector Machine (SVM)templateSVM
Trained Model TypeModel ObjectReturned By
Classification treeClassificationTreefitctree
Discriminant analysisClassificationDiscriminantfitcdiscr
Ensemble of bagged classification modelsClassificationBaggedEnsemblefitcensemble
Ensemble of classification modelsClassificationEnsemblefitcensemble
ECOC modelClassificationECOCfitcecoc
kNN ClassificationKNNfitcknn
Naive BayesClassificationNaiveBayesfitcnb
Neural networkClassificationNeuralNetwork (with observations in rows)fitcnet
SVMClassificationSVMfitcsvm

For efficiency, supply a classification model template object instead of a trained classification model object.

Data used to apply to the first full classification model or template, C1, specified as a numeric matrix or table.

Each row of X1 corresponds to one observation, and each column corresponds to one variable. testckfold does not support multicolumn variables and cell arrays other than cell arrays of character vectors.

X1 and X2 must be of the same data type, and X1, X2, Y must have the same number of observations.

If you specify Y as an array, then testckfold treats all columns of X1 as separate predictor variables.

Data Types: double | single | table

Data used to apply to the second full classification model or template, C2, specified as a numeric matrix or table.

Each row of X2 corresponds to one observation, and each column corresponds to one variable. testckfold does not support multicolumn variables and cell arrays other than cell arrays of character vectors.

X1 and X2 must be of the same data type, and X1, X2, Y must have the same number of observations.

If you specify Y as an array, then testckfold treats all columns of X2 as separate predictor variables.

Data Types: double | single | table

True class labels, specified as a categorical, character, or string array, a logical or numeric vector, a cell array of character vectors, or a character vector or string scalar.

• For a character vector or string scalar, X1 and X2 must be tables, their response variables must have the same name and values, and Y must be the common variable name. For example, if X1.Labels and X2.Labels are the response variables, then Y is 'Labels' and X1.Labels and X2.Labels must be equivalent.

• For all other supported data types, Y is an array of true class labels.

• If Y is a character array, then each element must correspond to one row of the array.

• X1, X2, Y must have the same number of observations (rows).

• If both of these statements are true, then you can omit supplying Y.

• X1 and X2 are tables containing the same response variable (values and name).

• C1 and C2 are full classification models containing ResponseName properties specifying the response variable names in X1 and X2.

Consequently, testckfold uses the common response variable in the tables. For example, if the response variables in the tables are X1.Labels and X2.Labels, and the values of C1.ResponseName and C2.ResponseName are 'Labels', then you do not have to supply Y.

Data Types: categorical | char | string | logical | single | double | cell

### Name-Value Pair Arguments

Specify optional comma-separated pairs of Name,Value arguments. Name is the argument name and Value is the corresponding value. Name must appear inside quotes. You can specify several name and value pair arguments in any order as Name1,Value1,...,NameN,ValueN.

Example: 'Alternative','greater','Test','10x10t','Options',statsset('UseParallel',true) specifies to test whether the first set of first predicted class labels is more accurate than the second set, to conduct the 10-by-10 t test, and to use parallel computing for cross-validation.

Hypothesis test significance level, specified as the comma-separated pair consisting of 'Alpha' and a scalar value in the interval (0,1).

Example: 'Alpha',0.1

Data Types: single | double

Alternative hypothesis to assess, specified as the comma-separated pair consisting of 'Alternative' and one of the values listed in the table.

ValueAlternative Hypothesis DescriptionSupported Tests
'unequal' (default)For predicting Y, the set of predictions resulting from C1 applied to X1 and C2 applied to X2 have unequal accuracies.'5x2F', '5x2t', and '10x10t'
'greater'For predicting Y, the set of predictions resulting from C1 applied to X1 is more accurate than C2 applied to X2. '5x2t' and '10x10t'
'less'For predicting Y, the set of predictions resulting from C1 applied to X1 is less accurate than C2 applied to X2.'5x2t' and '10x10t'

For details on supported tests, see Test.

Example: 'Alternative','greater'

Flag identifying categorical predictors in the first test-set predictor data (X1), specified as the comma-separated pair consisting of 'X1CategoricalPredictors' and one of the following:

• A numeric vector with indices from 1 through p, where p is the number of columns of X1.

• A logical vector of length p, where a true entry means that the corresponding column of X1 is a categorical variable.

• 'all', meaning all predictors are categorical.

Specification of X1CategoricalPredictors is appropriate if:

• At least one predictor is categorical and C1 is a classification tree, an ensemble of classification trees, an ECOC model, or a naive Bayes classification model.

• All predictors are categorical and C1 is a kNN classification model.

If you specify X1CategoricalPredictors for any other case, then testckfold throws an error. For example, the function cannot train SVM learners using categorical predictors.

The default is [], which indicates that there are no categorical predictors.

Example: 'X1CategoricalPredictors','all'

Data Types: single | double | logical | char | string

Flag identifying categorical predictors in the second test-set predictor data (X2), specified as the comma-separated pair consisting of 'X2CategoricalPredictors' and one of the following:

• A numeric vector with indices from 1 through p, where p is the number of columns of X2.

• A logical vector of length p, where a true entry means that the corresponding column of X2 is a categorical variable.

• 'all', meaning all predictors are categorical.

Specification of X2CategoricalPredictors is appropriate if:

• At least one predictor is categorical and C2 is a classification tree, an ensemble of classification trees, an ECOC model, or a naive Bayes classification model.

• All predictors are categorical and C2 is a kNN classification model.

If you specify X2CategoricalPredictors for any other case, then testckfold throws an error. For example, the function cannot train SVM learners using categorical predictors.

The default is [], which indicates that there are no categorical predictors.

Example: 'X2CategoricalPredictors','all'

Data Types: single | double | logical | char | string

Class names, specified as the comma-separated pair consisting of 'ClassNames' and a categorical, character, or string array, logical or numeric vector, or cell array of character vectors. You must set ClassNames using the data type of Y.

If ClassNames is a character array, then each element must correspond to one row of the array.

Use ClassNames to:

• Specify the order of any input argument dimension that corresponds to class order. For example, use ClassNames to specify the order of the dimensions of Cost.

• Select a subset of classes for testing. For example, suppose that the set of all distinct class names in Y is {'a','b','c'}. To train and test models using observations from classes 'a' and 'c' only, specify 'ClassNames',{'a','c'}.

The default is the set of all distinct class names in Y.

Example: 'ClassNames',{'b','g'}

Data Types: single | double | logical | char | string | cell | categorical

Classification cost, specified as the comma-separated pair consisting of 'Cost' and a square matrix or structure array.

• If you specify the square matrix Cost, then Cost(i,j) is the cost of classifying a point into class j if its true class is i. That is, the rows correspond to the true class and the columns correspond to the predicted class. To specify the class order for the corresponding rows and columns of Cost, additionally specify the ClassNames name-value pair argument.

• If you specify the structure S, then S must have two fields:

• S.ClassNames, which contains the class names as a variable of the same data type as Y. You can use this field to specify the order of the classes.

• S.ClassificationCosts, which contains the cost matrix, with rows and columns ordered as in S.ClassNames

For cost-sensitive testing use, testcholdout.

It is a best practice to supply the same cost matrix used to train the classification models.

The default is Cost(i,j) = 1 if i ~= j, and Cost(i,j) = 0 if i = j.

Example: 'Cost',[0 1 2 ; 1 0 2; 2 2 0]

Data Types: double | single | struct

Loss function, specified as the comma-separated pair consisting of 'LossFun' and 'classiferror', 'binodeviance', 'exponential', 'hinge', or a function handle.

• The following table lists the available loss functions.

ValueLoss Function
'binodeviance'Binomial deviance
'classiferror'Classification error
'exponential'Exponential loss
'hinge'Hinge loss

• Specify your own function using function handle notation.

Suppose that n = size(X,1) is the sample size and there are K unique classes. Your function must have the signature lossvalue = lossfun(C,S,W,Cost), where:

• The output argument lossvalue is a scalar.

• lossfun is the name of your function.

• C is an n-by-K logical matrix with rows indicating which class the corresponding observation belongs to. The column order corresponds to the class order in the ClassNames name-value pair argument.

Construct C by setting C(p,q) = 1 if observation p is in class q, for each row. Set all other elements of row p to 0.

• S is an n-by-K numeric matrix of classification scores. The column order corresponds to the class order in the ClassNames name-value pair argument. S is a matrix of classification scores.

• W is an n-by-1 numeric vector of observation weights. If you pass W, the software normalizes the weights to sum to 1.

• Cost is a K-by-K numeric matrix of classification costs. For example, Cost = ones(K) - eye(K) specifies a cost of 0 for correct classification and a cost of 1 for misclassification.

Parallel computing options, specified as the comma-separated pair consisting of 'Options' and a structure array returned by statset. These options require Parallel Computing Toolbox. testckfold uses 'Streams', 'UseParallel', and 'UseSubtreams' fields.

This table summarizes the available options.

OptionDescription
'Streams'

A RandStream object or cell array of such objects. If you do not specify Streams, the software uses the default stream or streams. If you specify Streams, use a single object except when the following are true:

• You have an open parallel pool.

• UseParallel is true.

• UseSubstreams is false.

In that case, use a cell array of the same size as the parallel pool. If a parallel pool is not open, then the software tries to open one (depending on your preferences), and Streams must supply a single random number stream.

'UseParallel'If you have Parallel Computing Toolbox, then you can invoke a pool of workers by setting 'UseParallel',true.
'UseSubstreams'Set to true to compute in parallel using the stream specified by 'Streams'. Default is false. For example, set Streams to a type allowing substreams, such as'mlfg6331_64' or 'mrg32k3a'.

Example: 'Options',statset('UseParallel',true)

Data Types: struct

Prior probabilities for each class, specified as the comma-separated pair consisting of 'Prior' and 'empirical', 'uniform', a numeric vector, or a structure.

This table summarizes the available options for setting prior probabilities.

ValueDescription
'empirical'The class prior probabilities are the class relative frequencies in Y.
'uniform'All class prior probabilities are equal to 1/K, where K is the number of classes.
numeric vectorEach element is a class prior probability. Specify the order using the ClassNames name-value pair argument. The software normalizes the elements such that they sum to 1.
structure

A structure S with two fields:

• S.ClassNames contains the class names as a variable of the same type as Y.

• S.ClassProbs contains a vector of corresponding prior probabilities. The software normalizes the elements such that they sum to 1.

Example: 'Prior',struct('ClassNames',{{'setosa','versicolor'}},'ClassProbs',[1,2])

Data Types: char | string | single | double | struct

Test to conduct, specified as the comma-separated pair consisting of 'Test' and one of he following: '5x2F', '5x2t', '10x10t'.

ValueDescriptionSupported Alternative Hypothesis
'5x2F' (default)5-by-2 paired F test. Appropriate for two-sided testing only.'unequal'
'5x2t'5-by-2 paired t test'unequal', 'less', 'greater'
'10x10t'10-by-10 repeated cross-validation t test'unequal', 'less', 'greater'

For details on the available tests, see Repeated Cross-Validation Tests. For details on supported alternative hypotheses, see Alternative.

Example: 'Test','10x10t'

Verbosity level, specified as the comma-separated pair consisting of 'Verbose' and 0, 1, or 2. Verbose controls the amount of diagnostic information that the software displays in the Command Window during training of each cross-validation fold.

This table summarizes the available verbosity level options.

ValueDescription
0The software does not display diagnostic information.
1The software displays diagnostic messages every time it implements a new cross-validation run.
2The software displays diagnostic messages every time it implements a new cross-validation run, and every time it trains on a particular fold.

Example: 'Verbose',1

Data Types: double | single

Observation weights, specified as the comma-separated pair consisting of 'Weights' and a numeric vector.

The size of Weights must equal the number of rows of X1. The software weighs the observations in each row of X with the corresponding weight in Weights.

The software normalizes Weights to sum up to the value of the prior probability in the respective class.

Data Types: double | single

Notes:

• testckfold treats trained classification models as templates. Therefore, it ignores all fitted parameters in the model. That is, testckfold cross-validates using only the options specified in the model and the predictor data.

• The repeated cross-validation tests depend on the assumption that the test statistics are asymptotically normal under the null hypothesis. Highly imbalanced cost matrices (for example, Cost = [0 100;1 0]) and highly discrete response distributions (that is, most of the observations are in a small number of classes) might violate the asymptotic normality assumption. For cost-sensitive testing, use testcholdout.

• NaNs, <undefined> values, empty character vectors (''), empty strings (""), and <missing> values indicate missing data values.

## Output Arguments

collapse all

Hypothesis test result, returned as a logical value.

h = 1 indicates the rejection of the null hypothesis at the Alpha significance level.

h = 0 indicates failure to reject the null hypothesis at the Alpha significance level.

Data Types: logical

p-value of the test, returned as a scalar in the interval [0,1]. p is the probability that a random test statistic is at least as extreme as the observed test statistic, given that the null hypothesis is true.

testckfold estimates p using the distribution of the test statistic, which varies with the type of test. For details on test statistics, see Repeated Cross-Validation Tests.

Classification losses, returned as a numeric matrix. The rows of e1 correspond to the cross-validation run and the columns correspond to the test fold.

testckfold applies the first test-set predictor data (X1) to the first classification model (C1) to estimate the first set of class labels.

e1 summarizes the accuracy of the first set of class labels predicting the true class labels (Y) for each cross-validation run and fold. The meaning of the elements of e1 depends on the type of classification loss.

Classification losses, returned as a numeric matrix. The rows of e2 correspond to the cross-validation run and the columns correspond to the test fold.

testckfold applies the second test-set predictor data (X2) to the second classification model (C2) to estimate the second set of class labels.

e2 summarizes the accuracy of the second set of class labels predicting the true class labels (Y) for each cross-validation run and fold. The meaning of the elements of e2 depends on the type of classification loss.

collapse all

### Repeated Cross-Validation Tests

Repeated cross-validation tests form the test statistic for comparing the accuracies of two classification models by combining the classification loss differences resulting from repeatedly cross-validating the data. Repeated cross-validation tests are useful when sample size is limited.

To conduct an R-by-K test:

1. Randomly divide (stratified by class) the predictor data sets and true class labels into K sets, R times. Each division is called a run and each set within a run is called a fold. Each run contains the complete, but divided, data sets.

2. For runs r = 1 through R, repeat these steps for k = 1 through K:

1. Reserve fold k as a test set, and train the two classification models using their respective predictor data sets on the remaining K – 1 folds.

2. Predict class labels using the trained models and their respective fold k predictor data sets.

3. Estimate the classification loss by comparing the two sets of estimated labels to the true labels. Denote ${e}_{crk}$ as the classification loss when the test set is fold k in run r of classification model c.

4. Compute the difference between the classification losses of the two models:

${\stackrel{^}{\delta }}_{rk}={e}_{1rk}-{e}_{2rk}.$

At the end of a run, there are K classification losses per classification model.

3. Combine the results of step 2. For each r = 1 through R:

• Estimate the within-fold averages of the differences and their average: ${\overline{\delta }}_{r}=\frac{1}{K}\sum _{k=1}^{K}{\stackrel{^}{\delta }}_{kr}.$

• Estimate the overall average of the differences: $\overline{\delta }=\frac{1}{KR}\sum _{r=1}^{R}\sum _{k=1}^{K}{\stackrel{^}{\delta }}_{rk}.$

• Estimate the within-fold variances of the differences: ${s}_{r}^{2}=\frac{1}{K}\sum _{k=1}^{K}{\left({\stackrel{^}{\delta }}_{rk}-{\overline{\delta }}_{r}\right)}^{2}.$

• Estimate the average of the within-fold differences: ${\overline{s}}^{2}=\frac{1}{R}\sum _{r=1}^{R}{s}_{r}^{2}.$

• Estimate the overall sample variance of the differences: ${S}^{2}=\frac{1}{KR-1}\sum _{r=1}^{R}\sum _{k=1}^{K}{\left({\stackrel{^}{\delta }}_{rk}-\overline{\delta }\right)}^{2}.$

Compute the test statistic. All supported tests described here assume that, under H0, the estimated differences are independent and approximately normally distributed, with mean 0 and a finite, common standard deviation. However, these tests violate the independence assumption, and so the test-statistic distributions are approximate.

• For R = 2, the test is a paired test. The two supported tests are a paired t and F test.

• The test statistic for the paired t test is

${t}_{paired}^{\ast }=\frac{{\stackrel{^}{\delta }}_{11}}{\sqrt{{\overline{s}}^{2}}}.$

${t}_{paired}^{\ast }$ has a t-distribution with R degrees of freedom under the null hypothesis.

To reduce the effects of correlation between the estimated differences, the quantity ${\stackrel{^}{\delta }}_{11}$ occupies the numerator rather than $\overline{\delta }$.

5-by-2 paired t tests can be slightly conservative [4].

• The test statistic for the paired F test is

${F}_{paired}^{\ast }=\frac{\frac{1}{RK}\sum _{r=1}^{R}\sum _{k=1}^{K}{\left({\stackrel{^}{\delta }}_{rk}\right)}^{2}}{{\overline{s}}^{2}}.$

${F}_{paired}^{\ast }$ has an F distribution with RK and R degrees of freedom.

A 5-by-2 paired F test has comparable power to the 5-by-2 t test, but is more conservative [1].

• For R > 2, the test is a repeated cross-validation test. The test statistic is

${t}_{CV}^{\ast }=\frac{\overline{\delta }}{S/\sqrt{\nu +1}}.$

${t}_{CV}^{\ast }$ has a t distribution with ν degrees of freedom. If the differences were truly independent, then ν = RK – 1. In this case, the degrees of freedom parameter must be optimized.

For a 10-by-10 repeated cross-validation t test, the optimal degrees of freedom between 8 and 11 ([2] and [3]). testckfold uses ν = 10.

The advantage of repeated cross-validation tests over paired tests is that the results are more repeatable [3]. The disadvantage is that they require high computational resources.

### Classification Loss

Classification losses indicate the accuracy of a classification model or set of predicted labels. In general, for a fixed cost matrix, classification accuracy decreases as classification loss increases.

testckfold returns the classification losses (see e1 and e2) under the alternative hypothesis (that is, the unrestricted classification losses). In the definitions that follow:

• The classification losses focus on the first classification model. The classification losses for the second model are similar.

• ntest is the test-set sample size.

• I(x) is the indicator function. If x is a true statement, then I(x) = 1. Otherwise, I(x) = 0.

• ${\stackrel{^}{p}}_{1j}$ is the predicted class assignment of classification model 1 for observation j.

• yj is the true class label of observation j.

• Binomial deviance has the form

${e}_{1}=\frac{\sum _{j=1}^{{n}_{test}}{w}_{j}\mathrm{log}\left(1+\mathrm{exp}\left(-2{y}_{j}^{\prime }f\left({X}_{j}\right)\right)\right)}{\sum _{j=1}^{{n}_{test}}{w}_{j}}$

where:

• yj = 1 for the positive class and -1 for the negative class.

• $f\left({X}_{j}\right)$ is the classification score.

The binomial deviance has connections to the maximization of the binomial likelihood function. For details on binomial deviance, see [5].

• Exponential loss is similar to binomial deviance and has the form

${e}_{1}=\frac{\sum _{j=1}^{{n}_{test}}{w}_{j}\mathrm{exp}\left(-{y}_{j}f\left({X}_{j}\right)\right)}{\sum _{j=1}^{{n}_{test}}{w}_{j}}.$

yj and $f\left({X}_{j}\right)$ take the same forms here as in the binomial deviance formula.

• Hinge loss has the form

${e}_{1}=\frac{\sum _{j=1}^{n}{w}_{j}\mathrm{max}\left\{0,1-{y}_{j}\prime f\left({X}_{j}\right)\right\}}{\sum _{j=1}^{n}{w}_{j}},$

yj and $f\left({X}_{j}\right)$ take the same forms here as in the binomial deviance formula.

Hinge loss linearly penalizes for misclassified observations and is related to the SVM objective function used for optimization. For more details on hinge loss, see [5].

• Misclassification rate, or classification error, is a scalar in the interval [0,1] representing the proportion of misclassified observations. That is, the misclassification rate for the first classification model is

${e}_{1}=\frac{\sum _{j=1}^{{n}_{test}}{w}_{j}I\left({\stackrel{^}{p}}_{1j}\ne {y}_{j}\right)}{\sum _{j=1}^{{n}_{test}}{w}_{j}}.$

## Tips

• Examples of ways to compare models include:

• Compare the accuracies of a simple classification model and a more complex model by passing the same set of predictor data.

• Compare the accuracies of two different models using two different sets of predictors.

• Perform various types of Feature Selection. For example, you can compare the accuracy of a model trained using a set of predictors to the accuracy of one trained on a subset or different set of predictors. You can arbitrarily choose the set of predictors, or use a feature selection technique like PCA or sequential feature selection (see pca and sequentialfs).

• If both of these statements are true, then you can omit supplying Y.

• X1 and X2 are tables containing the response variable and use the same response variable name.

• C1 and C2 are full classification models containing equal ResponseName properties (e.g. strcmp(C1.ResponseName,C2.ResponseName) = 1).

Consequently, testckfold uses the common response variable in the tables.

• One way to perform cost-insensitive feature selection is:

1. Create a classification model template that characterizes the first classification model (C1).

2. Create a classification model template that characterizes the second classification model (C2).

3. Specify two predictor data sets. For example, specify X1 as the full predictor set and X2 as a reduced set.

4. Enter testckfold(C1,C2,X1,X2,Y,'Alternative','less'). If testckfold returns 1, then there is enough evidence to suggest that the classification model that uses fewer predictors performs better than the model that uses the full predictor set.

Alternatively, you can assess whether there is a significant difference between the accuracies of the two models. To perform this assessment, remove the 'Alternative','less' specification in step 4.testckfold conducts a two-sided test, and h = 0 indicates that there is not enough evidence to suggest a difference in the accuracy of the two models.

• The tests are appropriate for the misclassification rate classification loss, but you can specify other loss functions (see LossFun). The key assumptions are that the estimated classification losses are independent and normally distributed with mean 0 and finite common variance under the two-sided null hypothesis. Classification losses other than the misclassification rate can violate this assumption.

• Highly discrete data, imbalanced classes, and highly imbalanced cost matrices can violate the normality assumption of classification loss differences.

## Algorithms

If you specify to conduct the 10-by-10 repeated cross-validation t test using 'Test','10x10t', then testckfold uses 10 degrees of freedom for the t distribution to find the critical region and estimate the p-value. For more details, see [2] and [3].

## Alternatives

Use testcholdout:

• For test sets with larger sample sizes

• To implement variants of the McNemar test to compare two classification model accuracies

• For cost-sensitive testing using a chi-square or likelihood ratio test. The chi-square test uses quadprog (Optimization Toolbox), which requires an Optimization Toolbox™ license.

## References

[1] Alpaydin, E. “Combined 5 x 2 CV F Test for Comparing Supervised Classification Learning Algorithms.” Neural Computation, Vol. 11, No. 8, 1999, pp. 1885–1992.

[2] Bouckaert. R. “Choosing Between Two Learning Algorithms Based on Calibrated Tests.” International Conference on Machine Learning, 2003, pp. 51–58.

[3] Bouckaert, R., and E. Frank. “Evaluating the Replicability of Significance Tests for Comparing Learning Algorithms.” Advances in Knowledge Discovery and Data Mining, 8th Pacific-Asia Conference, 2004, pp. 3–12.

[4] Dietterich, T. “Approximate statistical tests for comparing supervised classification learning algorithms.” Neural Computation, Vol. 10, No. 7, 1998, pp. 1895–1923.

[5] Hastie, T., R. Tibshirani, and J. Friedman. The Elements of Statistical Learning, 2nd Ed. New York: Springer, 2008.