Ensemble Algorithms
This topic provides descriptions of ensemble learning algorithms supported by Statistics and Machine Learning Toolbox™, including bagging, random space, and various boosting algorithms. You can specify the algorithm by using the 'Method'
name-value pair argument of fitcensemble
, fitrensemble
, or templateEnsemble
. Use fitcensemble
or fitrensemble
to create an ensemble of learners for classification or regression, respectively. Use templateEnsemble
to create an ensemble learner template, and pass the template to fitcecoc
to specify ensemble binary learners for ECOC multiclass learning.
For bootstrap aggregation (bagging) and random forest, you can use TreeBagger
as well.
To learn about how to choose an appropriate algorithm, see Choose an Applicable Ensemble Aggregation Method.
Note that usage of some algorithms, such as LPBoost
, TotalBoost
, and RobustBoost
, requires Optimization Toolbox™.
Bootstrap Aggregation (Bagging) and Random Forest
Statistics and Machine Learning Toolbox offers three objects for bagging and random forest:
ClassificationBaggedEnsemble
object created by thefitcensemble
function for classificationRegressionBaggedEnsemble
object created by thefitrensemble
function for regressionTreeBagger
object created by theTreeBagger
function for classification and regression
For details about the differences between TreeBagger
and
bagged ensembles (ClassificationBaggedEnsemble
and
RegressionBaggedEnsemble
), see Comparison of TreeBagger and Bagged Ensembles.
Bootstrap aggregation (bagging) is a type of ensemble learning. To bag a weak learner such as a decision tree on a data set, generate many bootstrap replicas of the data set and grow decision trees on the replicas. Obtain each bootstrap replica by randomly selecting N
out of N
observations with replacement, where N
is the data set size. In addition, every tree in the ensemble can randomly select predictors for each decision split, a technique called random forest [2] known to improve the accuracy of bagged trees. By default, the number of predictors to select at random for each split is equal to the square root of the number of predictors for classification, and one third of the number of predictors for regression. After training a model, you can find the predicted response of a trained ensemble for new data by using the predict
function. predict
takes an average over predictions from individual trees.
By default, the minimum number of observations per leaf for bagged trees is set to 1
for classification and 5
for regression. Trees grown with the default leaf size are usually very deep. These settings are close to optimal for the predictive power of an ensemble. Often you can grow trees with larger leaves without losing predictive power. Doing so reduces training and prediction time, as well as memory usage for the trained ensemble. You can control the minimum number of observations per leaf by using the 'MinLeafSize'
name-value pair argument of templateTree
or TreeBagger
. Note that you use the templateTree
function to specify the options of tree learners when you create a bagged ensemble by using fitcensemble
or fitrensemble
.
Several features of bagged decision trees make them a unique algorithm. Drawing
N
out of N
observations with replacement omits 37%
of observations, on average, for each decision tree. These omitted observations are called
“out-of-bag” observations. TreeBagger
and bagged ensembles
(ClassificationBaggedEnsemble
and
RegressionBaggedEnsemble
) have properties and object functions, whose
names start with oob
, that use out-of-bag observations.
Use the
oobPredict
function to estimate predictive power and feature importance. For each observation,oobPredict
estimates the out-of-bag prediction by averaging predictions from all trees in the ensemble for which the observation is out of bag.Estimate the average out-of-bag error by using
oobError
(forTreeBagger
) oroobLoss
(for bagged ensembles). These functions compare the out-of-bag predicted responses against the observed responses for all observations used for training. The out-of-bag average is an unbiased estimator of the true ensemble error.Obtain out-of-bag estimates of feature importance by using the
OOBPermutedPredictorDeltaError
property (forTreeBagger
) oroobPermutedPredictorImportance
property (for bagged ensembles). The software randomly permutes out-of-bag data across one variable or column at a time and estimates the increase in the out-of-bag error due to this permutation. The larger the increase, the more important the feature. Therefore, you do not need to supply test data for bagged ensembles because you can obtain reliable estimates of predictive power and feature importance in the process of training.
TreeBagger
also offers the proximity matrix in the Proximity
property. Every time two observations land on the same leaf of a tree, their proximity increases by 1. For normalization, sum these proximities over all trees in the ensemble and divide by the number of trees. The resulting matrix is symmetric with diagonal elements equal to 1 and off-diagonal elements ranging from 0 to 1. You can use this matrix to find outlier observations and discover clusters in the data through multidimensional scaling.
For examples using bagging, see:
Comparison of TreeBagger
and Bagged Ensembles
TreeBagger
and bagged ensembles (ClassificationBaggedEnsemble
and RegressionBaggedEnsemble
) share most functionalities, but not all. Additionally, some functionalities have different names.
TreeBagger
features not in bagged ensembles
Feature | TreeBagger Property | TreeBagger Method |
---|---|---|
Computation of proximity matrix | Proximity | When you estimate the proximity matrix and outliers of a |
Computation of outliers | OutlierMeasure | N/A |
Out-of-bag estimates of predictor importance using classification margins | OOBPermutedPredictorDeltaMeanMargin and OOBPermutedPredictorCountRaiseMargin
| N/A |
Merging two ensembles trained separately | N/A | append |
Quantile regression | N/A | quantilePredict , quantileError , oobQuantilePredict , oobQuantileError |
Tall array support for creating ensemble | N/A | For details, see Tall Arrays. |
Bagged ensemble features not in TreeBagger
Feature | Description |
---|---|
Hyperparameter optimization | Use the 'OptimizeHyperparameters' name-value pair argument. |
Binning numeric predictors to speed up training | Use the 'NumBins' name-value pair argument. |
Code generation for predict | After training a model, you can generate C/C++ code that predicts labels for new data. Generating C/C++ code requires MATLAB Coder™. For details, see Introduction to Code Generation. |
Different names for TreeBagger
and bagged ensembles
Feature | TreeBagger | Bagged Ensembles |
---|---|---|
Split criterion contributions for each predictor | DeltaCriterionDecisionSplit property | First output of predictorImportance (classification) or predictorImportance (regression) |
Predictor associations | SurrogateAssociation property | Second output of predictorImportance (classification) or predictorImportance (regression) |
Out-of-bag estimates of predictor importance | OOBPermutedPredictorDeltaError property | Output of oobPermutedPredictorImportance (classification) or oobPermutedPredictorImportance (regression) |
Error (misclassification probability or mean-squared error) | error and oobError methods | loss and oobLoss methods (classification) or loss and oobLoss methods (regression) |
Training additional trees and adding them to ensemble | growTrees method | resume method (classification) or resume method (regression) |
Mean classification margin per tree | meanMargin and oobMeanMargin methods | edge and oobEdge methods (classification) |
In addition, two important differences exist when you train a model and predict responses:
If you pass a misclassification cost matrix to
TreeBagger
, it passes the matrix along to the trees. If you pass a misclassification cost matrix tofitcensemble
, it uses the matrix to adjust the class prior probabilities.fitcensemble
then passes the adjusted prior probabilities and the default cost matrix to the trees. The default cost matrix isones(K)–eye(K)
forK
classes.Unlike the
loss
andedge
methods inClassificationBaggedEnsemble
, theTreeBagger
error
andmeanMargin
methods do not normalize input observation weights of the prior probabilities in the respective class.
Random Subspace
Use random subspace ensembles (Subspace
) to improve the accuracy of discriminant analysis (ClassificationDiscriminant
) or k-nearest neighbor (ClassificationKNN
) classifiers. Subspace
ensembles also have the advantage of using less memory than ensembles with all predictors, and can handle missing values (NaN
s).
The basic random subspace algorithm uses these parameters.
m is the number of dimensions (variables) to sample in each learner. Set m using the
NPredToSample
name-value pair.d is the number of dimensions in the data, which is the number of columns (predictors) in the data matrix
X
.n is the number of learners in the ensemble. Set n using the
NLearn
input.
The basic random subspace algorithm performs the following steps:
Choose without replacement a random set of m predictors from the d possible values.
Train a weak learner using just the m chosen predictors.
Repeat steps 1 and 2 until there are n weak learners.
Predict by taking an average of the
score
prediction of the weak learners, and classify the category with the highest averagescore
.
You can choose to create a weak learner for every possible set of m predictors from the d dimensions. To do so, set n, the number of learners, to 'AllPredictorCombinations'
. In this case, there are nchoosek(size(X,2),NPredToSample)
weak learners in the ensemble.
fitcensemble
downweights predictors after choosing them for a learner, so subsequent learners have a lower chance of using a predictor that was previously used. This weighting tends to make predictors more evenly distributed among learners than in uniform weighting.
For examples using Subspace
, see Random Subspace Classification.
Boosting Algorithms
Adaptive Boosting for Binary Classification
Adaptive boosting named AdaBoostM1
is a very popular boosting algorithm for binary classification. The algorithm trains learners sequentially. For every learner with index t, AdaBoostM1
computes the weighted classification error
where
xn is a vector of predictor values for observation n.
yn is the true class label.
ht is the prediction of learner (hypothesis) with index t.
is the indicator function.
is the weight of observation n at step t.
AdaBoostM1
then increases weights for observations misclassified by learner t and reduces weights for observations correctly classified by learner t. The next learner t + 1 is then trained on the data with updated weights .
After training finishes, AdaBoostM1
computes prediction for new data using
where
are weights of the weak hypotheses in the ensemble.
Training by AdaBoostM1
can be viewed as stagewise minimization of the exponential loss
where
yn ∊ {–1,+1} is the true class label.
wn are observation weights normalized to add up to 1.
f(xn) ∊ (–∞,+∞) is the predicted classification score.
The observation weights wn are the original observation weights you passed to fitcensemble
.
The second output from the predict
method of an AdaBoostM1
classification ensemble is an N-by-2 matrix of classification scores for the two classes and N observations. The second column in this matrix is always equal to minus the first column. The predict
method returns two scores to be consistent with multiclass models, though this is redundant because the second column is always the negative of the first.
Most often AdaBoostM1
is used with decision stumps (default) or shallow trees. If boosted stumps give poor performance, try setting the minimal parent node size to one quarter of the training data.
By default, the learning rate for boosting algorithms is 1
. If you set the learning rate to a lower number, the ensemble learns at a slower rate, but can converge to a better solution. 0.1
is a popular choice for the learning rate. Learning at a rate less than 1
is often called “shrinkage”.
For examples using AdaBoostM1
, see Conduct Cost-Sensitive Comparison of Two Classification Models and Create an Ensemble Template for ECOC Multiclass Learning.
Adaptive Boosting for Multiclass Classification
Adaptive boosting named AdaBoostM2
is an extension of AdaBoostM1
for multiple classes. Instead of weighted classification error, AdaBoostM2
uses weighted pseudo-loss for N observations and K classes
where
ht(xn,k) is the confidence of prediction by learner at step t into class k ranging from 0 (not at all confident) to 1 (highly confident).
are observation weights at step t for class k.
yn is the true class label taking one of the K values.
The second sum is over all classes other than the true class yn.
Interpreting the pseudo-loss is harder than classification error, but the idea is the same. Pseudo-loss can be used as a measure of the classification accuracy from any learner in an ensemble. Pseudo-loss typically exhibits the same behavior as a weighted classification error for AdaBoostM1
: the first few learners in a boosted ensemble give low pseudo-loss values. After the first few training steps, the ensemble begins to learn at a slower pace, and the pseudo-loss value approaches 0.5 from below.
For an example using AdaBoostM2
, see Predict Class Labels Using Classification Ensemble.
Gentle Adaptive Boosting
Gentle adaptive boosting (GentleBoost
, also known as Gentle AdaBoost) combines features of AdaBoostM1
and LogitBoost
. Like AdaBoostM1
, GentleBoost
minimizes the exponential loss. But its numeric optimization is set up differently. Like LogitBoost
, every weak learner fits a regression model to response values yn ∊ {–1,+1}.
fitcensemble
computes and stores the mean-squared error in the FitInfo
property of the ensemble object. The mean-squared error is
where
are observation weights at step t (the weights add up to 1).
ht(xn) are predictions of the regression model ht fitted to response values yn.
As the strength of individual learners weakens, the weighted mean-squared error approaches 1.
For examples using GentleBoost
, see Speed Up Training ECOC Classifiers Using Binning and Parallel Computing and Handle Imbalanced Data or Unequal Misclassification Costs in Classification Ensembles.
Adaptive Logistic Regression
Adaptive logistic regression (LogitBoost
) is another popular algorithm for binary classification. LogitBoost
works similarly to AdaBoostM1
, except it minimizes binomial deviance
where
yn ∊ {–1,+1} is the true class label.
wn are observation weights normalized to add up to 1.
f(xn) ∊ (–∞,+∞) is the predicted classification score.
Binomial deviance assigns less weight to badly misclassified observations (observations with large negative values of ynf(xn)). LogitBoost
can give better average accuracy than AdaBoostM1
for data with poorly separable classes.
Learner t in a LogitBoost
ensemble fits a regression model to response values
where
y*n ∊ {0,+1} are relabeled classes (0 instead of –1).
pt(xn) is the current ensemble estimate of the probability for observation xn to be of class 1.
fitcensemble
computes and stores the mean-squared error in the FitInfo
property of the ensemble object. The mean-squared error is
where
are observation weights at step t (the weights add up to 1).
ht(xn) are predictions of the regression model ht fitted to response values .
Values yn can range from –∞ to +∞, so the mean-squared error does not have well-defined bounds.
For examples using LogitBoost
, see Train Classification Ensemble and Speed Up Training by Binning Numeric Predictor Values.
Linear Programming Boosting
Linear programming boosting (LPBoost
), like TotalBoost
, performs multiclass classification by attempting to maximize the minimal margin in the training set. This attempt uses optimization algorithms, namely linear programming for LPBoost
. So you need an Optimization Toolbox license to use LPBoost
or TotalBoost
.
The margin of a classification is the difference between the predicted soft classification score for the true class, and the largest score for the false classes. For trees, the score of a classification of a leaf node is the posterior probability of the classification at that node. The posterior probability of the classification at a node is the number of training sequences that lead to that node with the classification, divided by the number of training sequences that lead to that node. For more information, see More About in margin
.
Why maximize the minimal margin? For one thing, the generalization error (the error on new data) is the probability of obtaining a negative margin. Schapire and Singer [10] establish this inequality on the probability of obtaining a negative margin:
Here m is the margin, θ is any positive number, V is the Vapnik-Chervonenkis dimension of the classifier space, N is the size of the training set, and δ is a small positive number. The inequality holds with probability 1–δ over many i.i.d. training and test sets. This inequality says: To obtain a low generalization error, minimize the number of observations below margin θ in the training set.
LPBoost
iteratively maximizes the minimal margin through a sequence of linear programming problems. Equivalently, by duality, LPBoost
minimizes the maximal edge, where edge is the weighted mean margin (see More About). At each iteration, there are more constraints in the problem. So, for large problems, the optimization problem becomes increasingly constrained, and slow to solve.
LPBoost
typically creates ensembles with many learners having weights that are orders of magnitude smaller than those of other learners. Therefore, to better enable you to remove the unimportant ensemble members, the compact
method reorders the members of an LPBoost
ensemble from largest weight to smallest. Therefore, you can easily remove the least important members of the ensemble using the removeLearners
method.
For an example using LPBoost
, see LPBoost and TotalBoost for Small Ensembles.
Least-Squares Boosting
Least-squares boosting (LSBoost
) fits regression ensembles. At every step, the ensemble fits a new learner to the difference between the observed response and the aggregated prediction of all learners grown previously. The ensemble fits to minimize mean-squared error.
You can use LSBoost
with shrinkage by passing in the LearnRate
parameter. By default this parameter is set to 1
, and the ensemble learns at the maximal speed. If you set LearnRate
to a value from 0
to 1
, the ensemble fits every new learner to yn – ηf(xn), where
yn is the observed response.
f(xn) is the aggregated prediction from all weak learners grown so far for observation xn.
η is the learning rate.
For examples using LSBoost
, see Train Regression Ensemble, Optimize a Boosted Regression Ensemble, and Ensemble Regularization.
Robust Boosting
Boosting algorithms such as AdaBoostM1
and LogitBoost
increase weights for misclassified observations at every boosting step. These weights can become very large. If this happens, the boosting algorithm sometimes concentrates on a few misclassified observations and neglects the majority of training data. Consequently the average classification accuracy suffers. In this situation, you can try using robust boosting (RobustBoost
). This algorithm does not assign almost the entire data weight to badly misclassified observations. It can produce better average classification accuracy. You need an Optimization Toolbox license to use RobustBoost
.
Unlike AdaBoostM1
and LogitBoost
, RobustBoost
does not minimize a specific loss function. Instead, it maximizes the number of observations with the classification margin above a certain threshold.
RobustBoost
trains based on time evolution. The algorithm starts at t = 0. At every step, RobustBoost
solves an optimization problem to find a positive step in time Δt and a corresponding positive change in the average margin for training data Δm. RobustBoost
stops training and exits if at least one of these three conditions is true:
Time t reaches 1.
RobustBoost
cannot find a solution to the optimization problem with positive updates Δt and Δm.RobustBoost
grows as many learners as you requested.
Results from RobustBoost
can be usable for any termination condition. Estimate the classification accuracy by cross validation or by using an independent test set.
To get better classification accuracy from RobustBoost
, you can adjust three parameters in fitcensemble
: RobustErrorGoal
, RobustMaxMargin
, and RobustMarginSigma
. Start by varying values for RobustErrorGoal
from 0 to 1. The maximal allowed value for RobustErrorGoal
depends on the two other parameters. If you pass a value that is too high, fitcensemble
produces an error message showing the allowed range for RobustErrorGoal
.
For an example using RobustBoost
, see Tune RobustBoost.
Random Undersampling Boosting
Random undersampling boosting (RUSBoost
) is especially effective at classifying imbalanced data, meaning some class in the training data has many fewer members than another. RUS stands for Random Under Sampling. The algorithm takes N, the number of members in the class with the fewest members in the training data, as the basic unit for sampling. Classes with more members are under sampled by taking only N observations of every class. In other words, if there are K classes, then, for each weak learner in the ensemble, RUSBoost
takes a subset of the data with N observations from each of the K classes. The boosting procedure follows the procedure in Adaptive Boosting for Multiclass Classification for reweighting and constructing the ensemble.
When you construct a RUSBoost
ensemble, there is an optional name-value pair called RatioToSmallest
. Give a vector of K values, each value representing the multiple of N to sample for the associated class. For example, if the smallest class has N = 100 members, then RatioToSmallest
= [2,3,4]
means each weak learner has 200 members in class 1, 300 in class 2, and 400 in class 3. If RatioToSmallest
leads to a value that is larger than the number of members in a particular class, then RUSBoost
samples the members with replacement. Otherwise, RUSBoost
samples the members without replacement.
For an example using RUSBoost
, see Classification with Imbalanced Data.
Totally Corrective Boosting
Totally corrective boosting (TotalBoost
), like linear programming boost (LPBoost
), performs multiclass classification by attempting to maximize the minimal margin in the training set. This attempt uses optimization algorithms, namely quadratic programming for TotalBoost
. So you need an Optimization Toolbox license to use LPBoost
or TotalBoost
.
The margin of a classification is the difference between the predicted soft classification score for the true class, and the largest score for the false classes. For trees, the score of a classification of a leaf node is the posterior probability of the classification at that node. The posterior probability of the classification at a node is the number of training sequences that lead to that node with the classification, divided by the number of training sequences that lead to that node. For more information, see More About in margin
.
Why maximize the minimal margin? For one thing, the generalization error (the error on new data) is the probability of obtaining a negative margin. Schapire and Singer [10] establish this inequality on the probability of obtaining a negative margin:
Here m is the margin, θ is any positive number, V is the Vapnik-Chervonenkis dimension of the classifier space, N is the size of the training set, and δ is a small positive number. The inequality holds with probability 1–δ over many i.i.d. training and test sets. This inequality says: To obtain a low generalization error, minimize the number of observations below margin θ in the training set.
TotalBoost
minimizes a proxy of the Kullback-Leibler divergence between the current weight distribution and the initial weight distribution, subject to the constraint that the edge (the weighted margin) is below a certain value. The proxy is a quadratic expansion of the divergence:
where Δ is the difference between W(n), the weights at the current and next iteration, and W0, the initial weight distribution, which is uniform. This optimization formulation keeps weights from becoming zero. At each iteration, there are more constraints in the problem. So, for large problems, the optimization problem becomes increasingly constrained, and slow to solve.
TotalBoost
typically creates ensembles with many learners having weights that are orders of magnitude smaller than those of other learners. Therefore, to better enable you to remove the unimportant ensemble members, the compact
method reorders the members of a TotalBoost
ensemble from largest weight to smallest. Therefore you can easily remove the least important members of the ensemble using the removeLearners
method.
For an example using TotalBoost
, see LPBoost and TotalBoost for Small Ensembles.
References
[1] Breiman, L. "Bagging Predictors." Machine Learning 26, 1996, pp. 123–140.
[2] Breiman, L. "Random Forests." Machine Learning 45, 2001, pp. 5–32.
[3] Breiman, L. https://www.stat.berkeley.edu/~breiman/RandomForests/
[4] Freund, Y. "A more robust boosting algorithm." arXiv:0905.2138v1, 2009.
[5] Freund, Y. and R. E. Schapire. "A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting." J. of Computer and System Sciences, Vol. 55, 1997, pp. 119–139.
[6] Friedman, J. "Greedy function approximation: A gradient boosting machine." Annals of Statistics, Vol. 29, No. 5, 2001, pp. 1189–1232.
[7] Friedman, J., T. Hastie, and R. Tibshirani. "Additive logistic regression: A statistical view of boosting." Annals of Statistics, Vol. 28, No. 2, 2000, pp. 337–407.
[8] Hastie, T., R. Tibshirani, and J. Friedman. The Elements of Statistical Learning, second edition. New York: Springer, 2008.
[9] Ho, T. K. "The random subspace method for constructing decision forests." IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 20, No. 8, 1998, pp. 832–844.
[10] Schapire, R., and Y. Singer. "Improved boosting algorithms using confidence-rated predictions." Machine Learning, Vol. 37, No. 3, 1999, pp. 297–336.
[11] Schapire, R. E. et al. "Boosting the margin: A new explanation for the effectiveness of voting methods." Annals of Statistics, Vol. 26, No. 5, 1998, pp. 1651–1686.
[12] Seiffert, C., T. Khoshgoftaar, J. Hulse, and A. Napolitano. "RUSBoost: Improving classification performance when training data is skewed." 19th International Conference on Pattern Recognition, 2008, pp. 1–4.
[13] Warmuth, M., J. Liao, and G. Ratsch. "Totally corrective boosting algorithms that maximize the margin." Proc. 23rd Int’l. Conf. on Machine Learning, ACM, New York, 2006, pp. 1001–1008.
See Also
fitcensemble
| fitrensemble
| TreeBagger
| ClassificationBaggedEnsemble
| RegressionBaggedEnsemble
| CompactClassificationEnsemble
| CompactRegressionEnsemble
| ClassificationKNN
| ClassificationDiscriminant
| RegressionEnsemble
| ClassificationEnsemble
| ClassificationPartitionedEnsemble
| RegressionPartitionedEnsemble