Main Content

Select Data for Classification or Open Saved App Session

When you first launch the Classification Learner app, you can choose to import data or to open a previously saved app session. To import data, see Select Data from Workspace and Import Data from File. To open a saved session, see Save and Open App Session.

Select Data from Workspace

Tip

In Classification Learner, tables are the easiest way to use your data, because they can contain numeric and label data. Use the Import Tool to bring your data into the MATLAB® workspace as a table, or use the table functions to create a table from workspace variables. See Tables.

  1. Load your data into the MATLAB workspace.

    Predictor and response variables can be numeric, categorical, string, or logical vectors, cell arrays of character vectors, or character arrays. The response variable cannot contain more than 500 unique class labels. Note: If your response variable is a string vector, then the predictions of the trained model form a cell array of character vectors.

    Combine the predictor data into one variable, either a table or a matrix. You can additionally combine your predictor data and response variable, or you can keep them separate.

    For example data sets, see Example Data for Classification.

  2. On the Apps tab, click Classification Learner.

  3. On the Learn tab, in the File section, click New Session > From Workspace.

  4. In the New Session from Workspace dialog box, under Data Set Variable, select a table or matrix from the list of workspace variables.

    If you select a matrix, choose whether to use rows or columns for observations by clicking the option buttons.

  5. Under Response, observe the default response variable. The app tries to select a suitable response variable from the data set variable and treats all other variables as predictors.

    If you want to use a different response variable, you can:

    • Use the list to select another variable from the data set variable.

    • Select a separate workspace variable by clicking the From workspace option button and then selecting a variable from the list.

  6. Under Predictors, add or remove predictors using the check boxes. Add or remove all predictors by clicking Add All or Remove All. You can also add or remove multiple predictors by selecting them in the table, and then clicking Add N or Remove N, where N is the number of selected predictors. The Add All and Remove All buttons change to Add N and Remove N when you select multiple predictors.

  7. To accept the default validation scheme and continue, click Start Session. The default validation option is 5-fold cross-validation, which protects against overfitting.

    Tip

    If you have a large data set you might want to switch to holdout validation. To learn more, see Choose Validation Scheme.

Note

If you prefer loading data into the app directly from the command line, you can specify the predictor data, response variable, and validation type to use in Classification Learner in the command line call to classificationLearner. For more information, see Classification Learner.

For next steps, see Train Classification Models in Classification Learner App.

Import Data from File

  1. On the Learn tab, in the File section, select New Session > From File.

  2. Select a file type in the list, such as spreadsheets, text files, or comma separated values (.csv) files, or select All Files to browse for other file types such as .dat.

Example Data for Classification

To get started using Classification Learner, try the following example data sets.

NameSizeDescription
Fisher Iris

Number of predictors: 4
Number of observations: 150
Number of classes: 3
Response: Species

Measurements from three species of iris. Try to classify the species.

For a step-by-step example, see Train Decision Trees Using Classification Learner App.

Create a table from the .csv file:

fishertable = readtable("fisheriris.csv");

Credit Rating

Number of predictors: 6
Number of observations: 3932
Number of classes: 7
Response: Rating

Financial ratios and industry sectors information for a list of corporate customers. The response variable consists of credit ratings (AAA, AA, A, BBB, BB, B, CCC) assigned by a rating agency.

Create a table from the CreditRating_Historical.dat file:

openExample("CreditRating_Historical.dat");

Cars

Number of predictors: 7
Number of observations: 100
Number of classes: 7
Response: Origin

Measurements of cars, in 1970, 1976, and 1982. Try to classify the country of origin.

Create a table from variables in the carsmall.mat file:

load carsmall
cartable = table(Acceleration, Cylinders, Displacement,...
Horsepower, Model_Year, MPG, Weight, Origin);

Arrhythmia

Number of predictors: 279
Number of observations: 452
Number of classes: 16
Response: Class (Y)

Patient information and response variables that indicate the presence and absence of cardiac arrhythmia. Misclassifying a patient as "normal" has more severe consequences than false positives classified as “has arrhythmia.”

Create a table from the .mat file:

openExample("arrhythmia.mat")
load arrhythmia.mat
Arrhythmia = array2table(X);
Arrhythmia.Class = categorical(Y);

Ovarian Cancer

Number of predictors: 4000
Number of observations: 216
Number of classes: 2
Response: Group

Ovarian cancer data generated using the WCX2 protein array. Includes 95 controls and 121 ovarian cancers.

Create a table from the .mat file:

openExample("ovariancancer.mat")
load ovariancancer.mat
ovariancancer = array2table(obs);
ovariancancer.Group = categorical(grp);

Ionosphere

Number of predictors: 34
Number of observations: 351
Number of classes: 2
Response: Group (Y)

Signals from a phased array of 16 high-frequency antennas. Good (“g”) returned radar signals are those showing evidence of some type of structure in the ionosphere. Bad (“b”) signals are those that pass through the ionosphere.

Create a table from the .mat file:

load ionosphere
ionosphere = array2table(X);
ionosphere.Group = Y;

Choose Validation Scheme

Choose a validation method to examine the predictive accuracy of the fitted models. Validation estimates model performance on new data compared to the training data, and helps you choose the best model. Validation protects against overfitting. Choose a validation scheme before training any models, so that you can compare all the models in your session using the same validation scheme.

Tip

Try the default validation scheme and click Start Session to continue. The default option is 5-fold cross-validation, which protects against overfitting.

If you have a large data set and training models takes too long using cross-validation, reimport your data and try the faster holdout validation instead.

Assume that no data is reserved for testing, which is true by default.

  • Cross-Validation: Select a number of folds (or divisions) to partition the data set.

    If you choose k folds, then the app:

    1. Partitions the data into k disjoint sets or folds

    2. For each validation fold:

      1. Trains a model using the training-fold observations (observations not in the validation fold)

      2. Assesses model performance using validation-fold data

    3. Calculates the average validation error over all folds

    This method gives a good estimate of the predictive accuracy of the final model trained with all the data. It requires multiple fits but makes efficient use of all the data, so it is recommended for small data sets.

  • Holdout Validation: Select a percentage of the data to use as a validation set. The app trains a model on the training set and assesses its performance with the validation set. The model used for validation is based on only a portion of the data, so Holdout Validation is recommended only for large data sets. The final model is trained with the full data set.

  • Resubstitution Validation: No protection against overfitting. The app uses all of the data for training and computes the error rate on the same data. Without any separate validation data, you get an unrealistic estimate of the model’s performance on new data. That is, the training sample accuracy is likely to be unrealistically high, and the predictive accuracy is likely to be lower.

    To help you avoid overfitting to the training data, choose another validation scheme instead.

Note

The validation scheme only affects the way that Classification Learner computes validation metrics. The final model is always trained using the full data set, excluding any data reserved for testing.

All the classification models you train after selecting data use the same validation scheme that you select in this dialog box. You can compare all the models in your session using the same validation scheme.

To change the validation selection and train new models, you can select data again, but you lose any trained models. The app warns you that importing data starts a new session. Save any trained models you want to keep to the workspace, and then import the data.

For next steps training models, see Train Classification Models in Classification Learner App.

(Optional) Reserve Data for Testing

When you import data into Classification Learner, you can specify to reserve a percentage of the data for testing. In the Test section of the New Session dialog box, click the check box to set aside a test data set. Specify the percentage of the imported data to use as a test set. If you prefer, you can still choose to import a separate test data set after starting an app session.

You can use the test set to evaluate the performance of a trained model. In particular, you can check whether the validation metrics provide good estimates for the model performance on new data. For more information, see Evaluate Test Set Model Performance. For an example, see Train Classifier Using Hyperparameter Optimization in Classification Learner App.

Note

The app does not use test data for model training. Models exported from the app are trained on the full training and validation data, excluding any data reserved for testing.

Save and Open App Session

In Classification Learner, you can save the current app session and open a previously saved app session.

  • To save the current app session, click Save in the File section of the Learn tab. When you first save the current session, you must specify the session file name and the file location. The Save Session option saves the current session, and the Save Session As option saves the current session to a new file. The Save Compact Session As option saves a compact version of the current app session, resulting in a smaller file size for the saved session. Note that the Save Compact Session As option permanently deletes the training data from all trained models in the current session.

  • To open a saved app session, click Open in the File section. In the Select File to Open dialog box, select the saved session you want to open.

Related Topics