Contenido principal

Prepare Data for Time Series Anomaly Detector Training and Validation

A major benefit of the time series anomaly detection process is that you need only normal data to train the detector, and normal data is usually abundant. For testing the detector, you need representative sets of anomalous data to determine whether the detector can detect the anomalies. This topic discusses the organization required for this data.

Accepted Data Formats

The app and the command-line functions both accept data as a matrix, a timetable, a cell array of matrices, or a cell array of timetables. For the app, these data sets can also include labels that provide ground truth, but this is not a requirement. The following sections describe the requirements for each data type.

Matrix

Unlabeled

Data should be purely numeric, with the rows ordered in time. Each column of the matrix represents an additional channel of data. When you train a model, the software uses the entire matrix and considers the data set to have only one member.

Labeled

  • The only command-line function that can use data labels is timeSeriesAnomalyMetrics. For this function, the you must provide labels as a separate vector.

  • For the app, you can define any variable of the timetable as the label variable when you import it into the app.

    To include labels in a matrix data set, the last column of the matrix should be a sample-wise label variable. Multi-class labels are supported, so it is not a strict requirement that all values are 0 or 1. When training a model, all columns except the last column are used. The last column is extracted from the matrix on import and used only where ground truth labels are shown in the app.

Timetable

Unlabeled

Data should be in a uniformly sampled timetable, with timestamps ordered and increasing. For both the command-line functions and the app, the software considers the entire timetable to be a single member. Each variable in the timetable represents a channel of data. All data channels are used for model training.

Labeled

  • The only command-line function that can use data labels is timeSeriesAnomalyMetrics. For this function, the labels must be provided as a separate vector.

  • For the app, you can define any variable of the timetable as the label variable when you import the timetable into the app.

    Labels can be multiclass and are allowed to be numeric, logical, string, or categorical. Since timetables use variable names, there is no requirement that the label variable ordered as the last variable. When the app trains a model, it uses all variables except the label variable. The app uses the label variable only for app metrics that use ground truth.

Cell Array of Matrices

Each cell must contain a numeric matrix, formatted as described in Matrix. The first cell sets the required number of columns, and each subsequent cell must have at least the same number of columns as the first cell. The software ignores any columns beyond that number.

The cell array can be 1-by-m or m-by-1, where m is the number of members.

Cell Array of Timetables

Each cell must contain a timetable, formatted as defined in Timetable. The first cell sets the required variables of the timetable, and each subsequent cell must have at least the same variables. The software ignores any additional variables.

The cell array can be 1-by-m or m-by-1, where m is the number of members.