Main Content

incrementalRobustRandomCutForest

Robust random cut forest model for incremental anomaly detection

Since R2023b

    Description

    The incrementalRobustRandomCutForest function creates an incrementalRobustRandomCutForest model object, which represents a robust random cut forest (RRCF) model for incremental anomaly detection.

    Unlike other Statistics and Machine Learning Toolbox™ model objects, incrementalRobustRandomCutForest can be called directly. Also, you can specify learning options, such as the number of robust random cut trees, the contamination fraction in the training data, and whether to standardize the predictor data before fitting the model to data. After you create an incrementalRobustRandomCutForest object, it is prepared for incremental learning (see Incremental Learning for Anomaly Detection).

    incrementalRobustRandomCutForest is best suited for incremental learning. For a traditional approach to anomaly detection when all the data is provided in advance, see rrcforest.

    Creation

    You can create an incrementalRobustRandomCutForest model object in several ways:

    • Call the function directly — Configure incremental learning options, or specify learner-specific options, by calling incrementalRobustRandomCutForest directly. This approach is best when you do not have data yet or you want to start incremental learning immediately.

    • Convert a traditionally trained model — To initialize a RRCF model for incremental learning using the model parameters and hyperparameters of a trained model object, you can convert the traditionally trained model to an incrementalRobustRandomCutForest model object by passing it to the incrementalLearner function.

    • Call an incremental learning functionfit accepts a configured incrementalRobustRandomCutForest model object and data as input, and returns an incrementalRobustRandomCutForest model object updated with information learned from the input model and data.

    Description

    forest = incrementalRobustRandomCutForest returns an incremental RRCF model object forest for anomaly detection with default parameters. Properties of a default model contain placeholders for unknown model parameters. You must train a default model before you can use it to detect anomalies.

    example

    forest = incrementalRobustRandomCutForest(Name,Value) sets properties and additional options using one or more name-value arguments. For example, incrementalRobustRandomCutForest(ContaminationFraction=0.1,ScoreWarmupPeriod=1000) sets the anomaly contamination fraction to 0.1 and the score warm-up period to 1000.

    example

    Input Arguments

    expand all

    Name-Value Arguments

    Specify optional pairs of arguments as Name1=Value1,...,NameN=ValueN, where Name is the argument name and Value is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

    Example: incrementalRobustRandomCutForest(StandardizeData=true) specifies to standardize the predictor data.

    Flag to standardize the predictor data, specified as a numeric or logical 1 (true) or 0 (false).

    If you set StandardizeData=true, the incrementalRobustRandomCutForest function centers and scales each predictor variable (X or Tbl) by the corresponding column mean and standard deviation. The function does not standardize the data contained in the dummy variable columns generated for categorical predictors.

    Example: StandardizeData=true

    Data Types: logical

    Options for computing in parallel and setting random streams, specified as a structure. Create the Options structure using statset. This table lists the option fields and their values.

    Field NameValueDefault
    UseParallelSet this value to true to run computations in parallel.false
    UseSubstreams

    Set this value to true to run computations in a reproducible manner.

    To compute reproducibly, set Streams to a type that allows substreams: "mlfg6331_64" or "mrg32k3a".

    false
    StreamsSpecify this value as a RandStream object or cell array of such objects. Use a single object except when the UseParallel value is true and the UseSubstreams value is false. In that case, use a cell array that has the same size as the parallel pool.If you do not specify Streams, then incrementalRobustRandomCutForest uses the default stream or streams.

    Note

    You need Parallel Computing Toolbox™ to run computations in parallel.

    Example: Options=statset(UseParallel=true,UseSubstreams=true,Streams=RandStream("mlfg6331_64"))

    Data Types: struct

    Warm-up period before score computation and anomaly detection, specified as a nonnegative integer. This option specifies the number of observations used by the incremental fit function to train the model and estimate the score threshold.

    Note

    When processing observations during the score warm-up period, the software ignores observations that contain missing values for all predictors.

    Example: ScoreWarmupPeriod=200

    Data Types: single | double

    Running window size used to estimate the score threshold (ScoreThreshold), specified as a positive integer. The default ScoreWindowSize value is 1000.

    If ScoreWindowSize is greater than the number of observations in the training data, the software determines ScoreThreshold by subsampling from the training data. Otherwise, ScoreThreshold is set to forest.ScoreThreshold.

    Example: ScoreWindowSize=100

    Data Types: single | double

    Properties

    expand all

    You can set most properties by using name-value argument syntax when you call incrementalRobustRandomCutForest directly. You can set some properties when you call incrementalLearner to convert a traditionally trained model object. You cannot set the properties Mu, NumTrainingObservations, ScoreThreshold, Sigma, and IsWarm.

    This property is read-only.

    List of categorical predictors, specified as one of the values in this table.

    ValueDescription
    Vector of positive integers

    Each entry in the vector is an index value indicating that the corresponding predictor is categorical. The index values are between 1 and p, where p is the number of predictors used to train the model.

    If incrementalRobustRandomCutForest uses a subset of input variables as predictors, then the function indexes the predictors using only the subset. The CategoricalPredictors values do not count any variables that the function does not use.

    Logical vector

    A true entry means that the corresponding predictor is categorical. The length of the vector is p.

    Character matrixEach row of the matrix is the name of a predictor variable. The names must match the entries in PredictorNames. Pad the names with extra blanks so each row of the character matrix has the same length.
    String array or cell array of character vectorsEach element in the array is the name of a predictor variable. The names must match the entries in PredictorNames.
    "all"All predictors are categorical.

    Data Types: single | double | logical | char | string | cell

    This property is read-only.

    Collusive displacement calculation method, specified as "maximal" or "average".

    The incrementalRobustRandomCutForest function finds the maximum change ("maximal") or the average change ("average") in model complexity for each tree, and computes the collusive displacement (anomaly score) for each observation.

    Data Types: char | string

    This property is read-only.

    Fraction of anomalies in the training data, specified as a numeric scalar in the range [0,1].

    • If the ContaminationFraction value is 0, then incrementalRobustRandomCutForest treats all training observations as normal observations, and sets the ScoreThreshold value to the maximum anomaly score value of the training data.

    • If the ContaminationFraction value is in the range (0,1], then incrementalRobustRandomCutForest determines the ScoreThreshold value so that the function detects the specified fraction of training observations as anomalies.

    The default ContaminationFraction value depends on how you create the model:

    • If you convert a traditionally trained model to create forest, then ContaminationFraction is specified by the corresponding property of the traditionally trained model.

    • If you create forest by calling incrementalRobustRandomCutForest directly, then you can specify ContaminationFraction by using name-value argument syntax. If you do not specify the value, then the default value is 0.

    Data Types: single | double

    This property is read-only.

    Number of observations processed by the incremental learner to estimate hyperparameters before training, specified as a nonnegative integer.

    • When processing observations during the estimation period, the software ignores observations that have missing values for all predictors.

    • If you specify a positive EstimationPeriod and StandardizeData is false, incrementalRobustRandomCutForest forces EstimationPeriod to 0.

    • If forest is prepared for incremental learning (all hyperparameters required for training are specified), incrementalRobustRandomCutForest forces EstimationPeriod to 0.

    • If forest is not prepared for incremental learning and StandardizeData is true, incrementalRobustRandomCutForest sets EstimationPeriod to 1000 and estimates the unknown hyperparameters.

    For more details, see Estimation Period.

    Data Types: single | double

    This property is read-only.

    Flag indicating whether the incremental fitting function fit returns scores and detects anomalies after training the model, specified as a numeric or logical 0 (false) or 1 (true).

    The incremental model forest is warm (IsWarm becomes true) after the fit function fits the incremental model to ScoreWarmupPeriod observations.

    You cannot specify IsWarm directly.

    Data Types: logical

    This property is read-only.

    Predictor means of the training data, specified as a numeric vector.

    • If you specify StandardizeData=true:

      • The fit function does not standardize columns that contain categorical variables. The elements in Mu for categorical variables contain NaN values.

      • The isanomaly function standardizes the input data by using the predictor means in Mu and standard deviations in Sigma.

      The length of Mu is equal to the number of predictors.

    • If you set StandardizeData=false, then Mu is an empty vector ([]).

    You cannot specify Mu directly.

    Data Types: single | double

    This property is read-only.

    Number of robust random cut trees (trees in the RRCF model), specified as a positive integer scalar.

    Data Types: single | double

    This property is read-only.

    Number of observations to draw from the training data without replacement for each robust random cut tree (tree in the RRCF model), specified as a positive integer scalar greater than or equal to 3.

    Data Types: single | double

    This property is read-only.

    Size of historical data that pertains to the RRCF model's knowledge, specified as a positive integer scalar.

    Data Types: single | double

    This property is read-only.

    Number of predictor variables, specified as a nonnegative numeric scalar.

    The default NumPredictors value depends on how you create the model:

    • If you convert a traditionally trained model to create forest, NumPredictors is specified by the corresponding property of the traditionally trained model.

    • If you create forest by calling incrementalRobustRandomCutForest directly, you can specify NumPredictors by using name-value argument syntax. If you do not specify the value, then the default value is 0, and incremental fitting functions infer NumPredictors from the predictor data during training.

    Data Types: double

    This property is read-only.

    Number of observations fit to the incremental model forest, specified as a nonnegative numeric scalar. NumTrainingObservations increases when you pass forest and training data to fit outside of the estimation period.

    • When fitting the model, the software ignores observations that have missing values for all predictors.

    • If you convert a traditionally trained model to create forest, incrementalRobustRandomCutForest does not add the number of observations fit to the traditionally trained model to NumTrainingObservations.

    You cannot specify NumTrainingObservations directly.

    Data Types: double

    Observation removal method, specified as "oldest", "timedecaying", or "random". When the robust random cut trees reach their capacity, the software removes old observations to accommodate the most recent data.

    ValueDescription

    "oldest"

    Oldest observations are removed first.

    "timedecaying"

    Observations are removed randomly in a weighted fashion. Older observations have a higher probability of being removed first.

    "random"

    Observations are removed in random order.

    Data Types: string | char

    This property is read-only.

    Predictor variable names, specified as a string array of unique names or cell array of unique character vectors. The functionality of PredictorNames depends on how you supply the predictor data.

    • If you supply Tbl, then you can use PredictorNames to specify which predictor variables to use. That is, incrementalRobustRandomCutForest uses only the predictor variables in PredictorNames.

      • PredictorNames must be a subset of Tbl.Properties.VariableNames.

      • By default, PredictorNames contains the names of all predictor variables in Tbl.

    • If you supply X, then you can use PredictorNames to assign names to the predictor variables in X.

      • The order of the names in PredictorNames must correspond to the column order of X. That is, PredictorNames{1} is the name of X(:,1), PredictorNames{2} is the name of X(:,2), and so on. Also, size(X,2) and numel(PredictorNames) must be equal.

      • By default, PredictorNames is {"x1","x2",...}.

    Data Types: string | cell

    This property is read-only.

    Threshold for the anomaly score used to detect anomalies, specified as a nonnegative integer. incrementalRobustRandomCutForest detects observations with scores above the threshold as anomalies.

    The default ScoreThreshold value depends on how you create the model:

    • If you convert a traditionally trained model object to create forest, then ScoreThreshold is specified by the corresponding property value of the object.

    • Otherwise, the default value is 0.

    ScoreThreshold has the value 0 until the number of observations reaches the ScoreWarmupPeriod value. After that, the software updates the ScoreThreshold with every new observation.

    You cannot specify ScoreThreshold directly.

    Data Types: single | double

    This property is read-only.

    Warm-up period before score computation and anomaly detection, specified as a nonnegative integer. This value is the number of observations used by the incremental fit function to train the model and estimate the score threshold.

    • When processing observations during the score warm-up period, the software ignores observations that have missing values for all predictors.

    • You can return scores and detect anomalies during the warm-up period by calling isanomaly directly.

    The default ScoreWarmupPeriod value depends on how you create the model:

    • If you convert a traditionally trained model to create forest, the ScoreWarmupPeriod name-value argument of the incrementalLearner function sets this property.

    • Otherwise, the default value is 0.

    Data Types: single | double

    This property is read-only.

    Running window size for ScoreThreshold estimation, specified as a nonnegative integer. The software estimates the ScoreThreshold value over a running window with a window size of ScoreWindowSize.

    The default ScoreWindowSize value depends on how you create the model:

    • If you convert a traditionally trained model to create forest, the ScoreWindowSize name-value argument of the incrementalLearner function sets this property.

    • Otherwise, the default value is 1000.

    Data Types: double

    This property is read-only.

    Predictor standard deviations of the training data, specified as a numeric vector.

    • If you specify StandardizeData=true when you train an incremental RRCF model using fit:

      • The fit function does not standardize columns that contain categorical variables. The elements in Sigma for categorical variables contain NaN values.

      • The isanomaly function standardizes the input data by using the predictor means in Mu and standard deviations in Sigma.

      The length of Sigma is equal to the number of predictors.

    • If you set StandardizeData=false, then Sigma is an empty vector ([]).

    You cannot specify Sigma directly.

    Object Functions

    fitTrain robust random cut forest model for incremental anomaly detection
    isanomalyFind anomalies in data using robust random cut forest (RRCF) for incremental learning
    resetReset incremental robust random cut forest model

    Examples

    collapse all

    Create a default robust random cut forest model for incremental anomaly detection.

    forest = incrementalRobustRandomCutForest;
    details(forest)
      incrementalRobustRandomCutForest with properties:
    
            CollusiveDisplacement: 'maximal'
                      NumLearners: 100
        NumObservationsPerLearner: 256
               ObservationRemoval: 'oldest'
            NumObservationsToKeep: 256
                               Mu: []
                            Sigma: []
            CategoricalPredictors: []
                 EstimationPeriod: 0
                           IsWarm: 0
            ContaminationFraction: 0
          NumTrainingObservations: 0
                    NumPredictors: 0
                   ScoreThreshold: 0
                ScoreWarmupPeriod: 0
                   PredictorNames: {}
                  ScoreWindowSize: 1000
    

    forest is an incrementalRobustRandomCutForest model object. All its properties are read-only. By default, the software sets the anomaly contamination fraction to 0 and the score warm-up period to 0. forest must be fit to data before you can use it to perform any other operations.

    Load Data

    Load the human activity data set and keep only the first 3000 observations. For details on the data set, enter Description at the command line.

    load humanactivity.mat
    feat = feat(1:3000,:);

    Fit Incremental Model and Detect Anomalies

    Fit the incremental model forest to the data by using the fit function. Because ScoreWarmupPeriod = 0, fit returns scores and detects anomalies immediately after fitting the model for the first time. To simulate a data stream, fit the model in chunks of 100 observations at a time. At each iteration:

    • Process 100 observations.

    • Overwrite the previous incremental model with a new one fitted to the incoming observations.

    • Store medianscore, the median score value of the data chunk, to see how it evolves during incremental learning.

    • Store allscores, the score values for the fitted observations.

    • Store threshold, the score threshold value for anomalies, to see how it evolves during incremental learning.

    • Store numAnom, the number of detected anomalies in the data chunk.

    n = numel(feat(:,1));
    numObsPerChunk = 100;
    nchunk = floor(n/numObsPerChunk);
    medianscore = zeros(nchunk,1);
    threshold = zeros(nchunk,1);    
    numAnom = zeros(nchunk,1);
    allscores = [];
    
    % Incremental fitting
    rng(0,"twister"); % For reproducibility
    for j = 1:nchunk
        ibegin = min(n,numObsPerChunk*(j-1) + 1);
        iend = min(n,numObsPerChunk*j);
        idx = ibegin:iend;    
        forest = fit(forest,feat(idx,:));
        [isanom,scores] = isanomaly(forest,feat(idx,:));
        medianscore(j) = median(scores);
        allscores = [allscores scores'];    
        numAnom(j) = sum(isanom);
        threshold(j) = forest.ScoreThreshold;
    end

    forest is an incrementalRobustRandomCutForest model object trained on all the data in the stream. The fit function fits the model to the data chunk, and the isanomaly function returns the observation scores and the indices of observations in the data chunk with scores above the score threshold value.

    Analyze Incremental Model During Training

    Plot the anomaly score for every observation.

    plot(allscores,".-")
    xlabel("Observation")
    ylabel("Score")

    Figure contains an axes object. The axes object with xlabel Observation, ylabel Score contains an object of type line.

    At each iteration, the software calculates a score value for each observation in the data chunk. A low score value indicates a normal observation, and a high score value indicates an anomaly.

    To see how the score threshold and median score per data chunk evolve during training, plot them on separate tiles.

    figure
    tiledlayout(2,1);
    nexttile
    plot(medianscore,".-")
    ylabel("Median Score")
    xlabel("Iteration")
    xlim([0 nchunk])
    nexttile
    plot(threshold,".-")
    ylabel("Score Threshold")
    xlabel("Iteration")
    xlim([0 nchunk])

    Figure contains 2 axes objects. Axes object 1 with xlabel Iteration, ylabel Median Score contains an object of type line. Axes object 2 with xlabel Iteration, ylabel Score Threshold contains an object of type line.

    finalScoreThreshold=forest.ScoreThreshold
    finalScoreThreshold = 
    93.7052
    

    The median score fluctuates between 4 and 20. The anomaly score threshold has a value of 20 after the first iteration and steadily approaches a value of 94 by the 22nd iteration. Because ContaminationFraction = 0, incrementalRobustRandomCutForest treats all training observations as normal observations, and at each iteration sets the score threshold to the maximum score value in the data chunk.

    totalAnomalies = sum(numAnom)
    totalAnomalies = 
    0
    

    No anomalies are detected at any iteration, because ContaminationFraction = 0.

    Prepare an incremental robust random cut forest model by specifying an anomaly contamination fraction of 0.001, and standardize the data using an initial estimation period of 500 observations. Specify a score warm-up period of 1000 observations, during which the fit function updates the score threshold and trains the model but does not return scores or identify anomalies.

    forest = incrementalRobustRandomCutForest(ContaminationFraction=0.001, ...
        StandardizeData=true,ScoreWarmupPeriod=1000,EstimationPeriod=500);

    forest is an incrementalRobustRandomCutForest model object. All its properties are read-only. forest must be fit to data before you can use it to perform any other operations.

    Load Data

    Load the credit rating data stored in CreditRating_Historical.dat. Remove the ID column and the categorical variables.

    creditrating = readtable("CreditRating_Historical.dat");
    creditrating = removevars(creditrating,["ID","Industry","Rating"]);

    The fit function of incrementalRobustRandomCutForest does not use observations with missing values. Remove missing values in the data sets to reduce memory consumption and speed up training.

    creditrating = rmmissing(creditrating);

    Fit Incremental Model and Detect Anomalies

    Fit the incremental model Mdl to the data by using the fit function. To simulate a data stream, fit the model in chunks of 100 observations at a time. Because EstimationPeriod = 500 and ScoreWarmupPeriod = 1000, fit only returns scores and detects anomalies after 15 iterations. At each iteration:

    • Process 100 observations.

    • Overwrite the previous incremental model with a new one fitted to the incoming observations.

    • Store meanscore, the mean score value of the data chunk, to see how it evolves during incremental learning.

    • Store threshold, the score threshold value for anomalies, to see how it evolves during incremental learning.

    • Store numAnom, the number of detected anomalies in the chunk, to see how it evolves during incremental learning.

    n = numel(creditrating(:,1));
    numObsPerChunk = 100;
    nchunk = floor(n/numObsPerChunk);
    meanscore = zeros(nchunk,1);
    threshold = zeros(nchunk,1);    
    numAnom = zeros(nchunk,1);
    
    % Incremental fitting
    rng(0,"twister"); % For reproducibility
    for j = 1:nchunk
        ibegin = min(n,numObsPerChunk*(j-1) + 1);
        iend = min(n,numObsPerChunk*j);
        idx = ibegin:iend;    
        [forest,tf,scores] = fit(forest,creditrating(idx,:));
        meanscore(j) = mean(scores);
        numAnom(j) = sum(tf);
        threshold(j) = forest.ScoreThreshold;
    end

    forest is an incrementalRobustRandomCutForest model object trained on all the data in the stream.

    Analyze Incremental Model During Training

    To see how the mean score, score threshold and number of detected anomalies per chunk evolve during training, plot them on separate tiles.

    tiledlayout(3,1);
    nexttile
    plot(meanscore)
    ylabel("Mean Score")
    xlabel("Iteration")
    xlim([0 nchunk])
    xline(forest.EstimationPeriod/numObsPerChunk,"r-.")
    xline((forest.EstimationPeriod+forest.ScoreWarmupPeriod)/numObsPerChunk,"r")
    nexttile
    plot(threshold)
    ylabel("Score Threshold")
    xlabel("Iteration")
    xlim([0 nchunk])
    xline(forest.EstimationPeriod/numObsPerChunk,"r-.")
    xline((forest.EstimationPeriod+forest.ScoreWarmupPeriod)/numObsPerChunk,"r")
    nexttile
    plot(numAnom,"+")
    ylabel("Anomalies")
    xlabel("Iteration")
    xlim([0 nchunk])
    ylim([0 max(numAnom)+0.2])
    xline(forest.EstimationPeriod/numObsPerChunk,"r-.")
    xline((forest.EstimationPeriod+forest.ScoreWarmupPeriod)/numObsPerChunk,"r")

    Figure contains 3 axes objects. Axes object 1 with xlabel Iteration, ylabel Mean Score contains 3 objects of type line, constantline. Axes object 2 with xlabel Iteration, ylabel Score Threshold contains 3 objects of type line, constantline. Axes object 3 with xlabel Iteration, ylabel Anomalies contains 3 objects of type line, constantline. One or more of the lines displays its values using only markers

    During the estimation period, fit estimates means and standard deviations using the observations, and does not fit the model or update the score threshold. During the warm-up period, fit fits the model and updates the score threshold, but returns all scores as NaN and all anomaly values as false. After the warm-up period, fit returns the observation scores and the indices of observations with scores above the score threshold value. A small score value indicates a normal observation, and a large score value indicates an anomaly.

    totalAnomalies=sum(numAnom)
    totalAnomalies = 
    3
    
    anomfrac= totalAnomalies/(n-forest.EstimationPeriod-forest.ScoreWarmupPeriod)
    anomfrac = 
    0.0012
    

    The software detects 3 anomalies after the warm-up and estimation periods. The contamination fraction after the estimation and warm-up periods is approximately 0.001.

    More About

    expand all

    Algorithms

    expand all

    References

    [1] Guha, Sudipto, N. Mishra, G. Roy, and O. Schrijvers. "Robust Random Cut Forest Based Anomaly Detection on Streams," Proceedings of The 33rd International Conference on Machine Learning 48 (June 2016): 2712–21.

    [2] Bartos, Matthew D., A. Mullapudi, and S. C. Troutman. "rrcf: Implementation of the Robust Random Cut Forest Algorithm for Anomaly Detection on Streams." Journal of Open Source Software 4, no. 35 (2019): 1336.

    Extended Capabilities

    Version History

    Introduced in R2023b