This topic provides descriptions of ensemble learning algorithms supported by Statistics and Machine Learning Toolbox™, including bagging, random space, and various boosting algorithms. You can
specify the algorithm by using the `'Method'`

name-value pair argument of
`fitcensemble`

, `fitrensemble`

, or `templateEnsemble`

. Use `fitcensemble`

or `fitrensemble`

to create an ensemble of learners for classification or regression,
respectively. Use `templateEnsemble`

to create an ensemble learner
template, and pass the template to `fitcecoc`

to specify ensemble binary learners for
ECOC multiclass learning.

For bootstrap aggregation (bagging) and random forest, you can use `TreeBagger`

as well.

To learn about how to choose an appropriate algorithm, see Choose an Applicable Ensemble Aggregation Method.

Note that usage of some algorithms, such as `LPBoost`

,
`TotalBoost`

, and `RobustBoost`

, requires Optimization Toolbox™.

Statistics and Machine Learning Toolbox offers three objects for bagging and random forest:

`ClassificationBaggedEnsemble`

created by`fitcensemble`

for classification`RegressionBaggedEnsemble`

created by`fitrensemble`

for regression`TreeBagger`

created by`TreeBagger`

for classification and regression

For details about the differences between `TreeBagger`

and
bagged ensembles (`ClassificationBaggedEnsemble`

and
`RegressionBaggedEnsemble`

), see Comparison of TreeBagger and Bagged Ensembles.

*Bootstrap aggregation* (*bagging*) is a type
of ensemble learning. To bag a weak learner such as a decision tree on a data set, generate
many bootstrap replicas of the data set and grow decision trees on the replicas. Obtain each
bootstrap replica by randomly selecting `N`

out of `N`

observations with replacement, where `N`

is the data set size. In addition,
every tree in the ensemble can randomly select predictors for each decision split, a
technique called random forest [2] known to improve the
accuracy of bagged trees. By default, the number of predictors to select at random for each
split is equal to the square root of the number of predictors for classification, and one
third of the number of predictors for regression. After training a model, you can find the
predicted response of a trained ensemble for new data by using the
`predict`

function. `predict`

takes an average over
predictions from individual trees.

By default, the minimum number of observations per leaf for bagged trees is set to
`1`

for classification and `5`

for regression. Trees
grown with the default leaf size are usually very deep. These settings are close to optimal
for the predictive power of an ensemble. Often you can grow trees with larger leaves without
losing predictive power. Doing so reduces training and prediction time, as well as memory
usage for the trained ensemble. You can control the minimum number of observations per leaf
by using the `'MinLeafSize'`

name-value pair argument of
`templateTree`

or `TreeBagger`

. Note that you use
the `templateTree`

function to specify the options of tree learners when
you create a bagged ensemble by using `fitcensemble`

or `fitrensemble`

.

Several features of bagged decision trees make them a unique algorithm. Drawing
`N`

out of `N`

observations with replacement omits of
observations, on average 37%, for each decision tree. These omitted observations are called
“out-of-bag” observations. `TreeBagger`

and bagged ensembles
(`ClassificationBaggedEnsemble`

and
`RegressionBaggedEnsemble`

) have properties and object functions, whose
names start with `oob`

, that use out-of-bag observations.

Use the

`oobPredict`

function to estimate predictive power and feature importance. For each observation,`oobPredict`

estimates the out-of-bag prediction by averaging predictions from all trees in the ensemble for which the observation is out of bag.Estimate the average out-of-bag error by using

`oobError`

(for`TreeBagger`

) or`oobLoss`

(for bagged ensembles). These functions compare the out-of-bag predicted responses against the observed responses for all observations used for training. The out-of-bag average is an unbiased estimator of the true ensemble error.Obtain out-of-bag estimates of feature importance by using the

`OOBPermutedPredictorDeltaError`

property (for`TreeBagger`

) or`oobPermutedPredictorImportance`

property (for bagged ensembles). The software randomly permutes out-of-bag data across one variable or column at a time and estimates the increase in the out-of-bag error due to this permutation. The larger the increase, the more important the feature. Therefore, you do not need to supply test data for bagged ensembles because you can obtain reliable estimates of predictive power and feature importance in the process of training.

`TreeBagger`

also offers the proximity matrix in the
`Proximity`

property. Every time two observations land on the same leaf
of a tree, their proximity increases by 1. For normalization, sum these proximities over all
trees in the ensemble and divide by the number of trees. The resulting matrix is symmetric
with diagonal elements equal to 1 and off-diagonal elements ranging from 0 to 1. You can use
this matrix to find outlier observations and discover clusters in the data through
multidimensional scaling.

For examples using bagging, see:

`TreeBagger`

and Bagged Ensembles`TreeBagger`

and bagged ensembles (`ClassificationBaggedEnsemble`

and `RegressionBaggedEnsemble`

) share most functionalities, but not all.
Additionally, some functionalities have different names.

`TreeBagger`

features not in bagged ensembles

Feature | `TreeBagger` Property | `TreeBagger` Method |
---|---|---|

Computation of proximity matrix | `Proximity` | When you
estimate the proximity matrix and outliers of a |

Computation of outliers | `OutlierMeasure` | N/A |

Out-of-bag estimates of predictor importance using classification margins | `OOBPermutedPredictorDeltaMeanMargin` and
`OOBPermutedPredictorCountRaiseMargin`
| N/A |

Merging two ensembles trained separately | N/A | `append` |

Quantile regression | N/A | `quantilePredict` , `quantileError` , `oobQuantilePredict` , `oobQuantileError` |

Parallel computation for creating ensemble | N/A | Set the For dual-core systems and above, |

Tall array support for creating ensemble | N/A | For details, see Tall Arrays. |

For dual-core systems and above, `fitrensemble`

and `fitcensemble`

also parallelize training using
Intel Threading Building Blocks (TBB).

**Bagged ensemble features not in TreeBagger**

Feature | Description |
---|---|

Hyperparameter optimization | Use the `'OptimizeHyperparameters'` name-value pair
argument. |

Binning numeric predictors to speed up training | Use the `'NumBins'` name-value pair argument. |

Code generation for `predict` | After training a model, you can generate C/C++ code that predicts labels for new data. Generating C/C++ code requires MATLAB Coder™. For details, see Introduction to Code Generation. |

**Different names for TreeBagger and bagged ensembles**

Feature | TreeBagger | Bagged Ensembles |
---|---|---|

Split criterion contributions for each predictor | `DeltaCriterionDecisionSplit` property | First output of `predictorImportance` (classification)
or `predictorImportance`
(regression) |

Predictor associations | `SurrogateAssociation` property | Second output of `predictorImportance` (classification)
or `predictorImportance`
(regression) |

Out-of-bag estimates of predictor importance | `OOBPermutedPredictorDeltaError` property | Output of `oobPermutedPredictorImportance`
(classification) or `oobPermutedPredictorImportance`
(regression) |

Error (misclassification probability or mean-squared error) | `error` and `oobError` methods | `loss` and `oobLoss` methods (classification) or
`loss` and `oobLoss` methods (regression) |

Training additional trees and adding them to ensemble | `growTrees` method | `resume` method (classification) or
`resume` method (regression) |

Mean classification margin per tree | `meanMargin` and `oobMeanMargin` methods | `edge` and `oobEdge` methods
(classification) |

In addition, two important differences exist when you train a model and predict responses:

If you pass a misclassification cost matrix to

`TreeBagger`

, it passes the matrix along to the trees. If you pass a misclassification cost matrix to`fitcensemble`

, it uses the matrix to adjust the class prior probabilities.`fitcensemble`

then passes the adjusted prior probabilities and the default cost matrix to the trees. The default cost matrix is`ones(K)–eye(K)`

for`K`

classes.Unlike the

`loss`

and`edge`

methods in`ClassificationBaggedEnsemble`

, the`TreeBagger`

`error`

and`meanMargin`

methods do not normalize input observation weights of the prior probabilities in the respective class.

Use random subspace ensembles (`Subspace`

) to improve the accuracy of
discriminant analysis (`ClassificationDiscriminant`

) or
*k*-nearest neighbor (`ClassificationKNN`

) classifiers. `Subspace`

ensembles also have
the advantage of using less memory than ensembles with all predictors, and can handle
missing values (`NaN`

s).

The basic random subspace algorithm uses these parameters.

*m*is the number of dimensions (variables) to sample in each learner. Set*m*using the`NPredToSample`

name-value pair.*d*is the number of dimensions in the data, which is the number of columns (predictors) in the data matrix`X`

.*n*is the number of learners in the ensemble. Set*n*using the`NLearn`

input.

The basic random subspace algorithm performs the following steps:

Choose without replacement a random set of

*m*predictors from the*d*possible values.Train a weak learner using just the

*m*chosen predictors.Repeat steps 1 and 2 until there are

*n*weak learners.Predict by taking an average of the

`score`

prediction of the weak learners, and classify the category with the highest average`score`

.

You can choose to create a weak learner for every possible set of *m*
predictors from the *d* dimensions. To do so, set *n*, the
number of learners, to `'AllPredictorCombinations'`

. In this case, there
are `nchoosek(size(X,2),NPredToSample)`

weak learners in the
ensemble.

`fitcensemble`

downweights predictors after choosing them for a
learner, so subsequent learners have a lower chance of using a predictor that was previously
used. This weighting tends to make predictors more evenly distributed among learners than in
uniform weighting.

For examples using `Subspace`

, see Random Subspace Classification.

Adaptive boosting named `AdaBoostM1`

is a very popular boosting
algorithm for binary classification. The algorithm trains learners sequentially. For every
learner with index *t*, `AdaBoostM1`

computes the
weighted classification error

$${\epsilon}_{t}={\displaystyle \sum}_{n=1}^{N}{d}_{n}^{\left(t\right)}\mathbb{I}\left({y}_{n}\ne {h}_{t}\left({x}_{n}\right)\right),$$

where

*x*is a vector of predictor values for observation_{n}*n*.*y*is the true class label._{n}*h*is the prediction of learner (hypothesis) with index_{t}*t*.$$\mathbb{I}$$ is the indicator function.

$${d}_{n}^{\left(t\right)}$$ is the weight of observation

*n*at step*t*.

`AdaBoostM1`

then increases weights for observations misclassified by
learner *t* and reduces weights for observations correctly classified by
learner *t*. The next learner *t* + 1 is then trained on the data with updated weights ${d}_{n}^{\left(t+1\right)}$.

After training finishes, `AdaBoostM1`

computes prediction for new
data using

$f\left(x\right)={\displaystyle \sum}_{t=1}^{T}{\alpha}_{t}{h}_{t}\left(x\right),$

where

${\alpha}_{t}=\frac{1}{2}\mathrm{log}\frac{1-{\epsilon}_{t}}{{\epsilon}_{t}}$

are weights of the weak hypotheses in the ensemble.

Training by `AdaBoostM1`

can be viewed as stagewise minimization of
the exponential loss

$\sum}_{n=1}^{N}{w}_{n}\mathrm{exp}\left(-{y}_{n}f\left({x}_{n}\right)\right),$

where

*y*∊ {–1,+1} is the true class label._{n}*w*are observation weights normalized to add up to 1._{n}*f*(*x*) ∊ (–∞,+∞) is the predicted classification score._{n}

The observation weights *w _{n}* are the original
observation weights you passed to

`fitcensemble`

.The second output from the `predict`

method of an
`AdaBoostM1`

classification ensemble is an *N*-by-2
matrix of classification scores for the two classes and *N* observations.
The second column in this matrix is always equal to minus the first column. The
`predict`

method returns two scores to be consistent with multiclass
models, though this is redundant because the second column is always the negative of the
first.

Most often `AdaBoostM1`

is used with decision stumps (default) or
shallow trees. If boosted stumps give poor performance, try setting the minimal parent
node size to one quarter of the training data.

By default, the learning rate for boosting algorithms is `1`

. If you
set the learning rate to a lower number, the ensemble learns at a slower rate, but can
converge to a better solution. `0.1`

is a popular choice for the learning
rate. Learning at a rate less than `1`

is often called
“shrinkage”.

For examples using `AdaBoostM1`

, see Conduct Cost-Sensitive Comparison of Two Classification Models and Create an Ensemble Template for ECOC Multiclass Learning.

Adaptive boosting named `AdaBoostM2`

is an extension of
`AdaBoostM1`

for multiple classes. Instead of weighted classification
error, `AdaBoostM2`

uses weighted pseudo-loss for *N*
observations and *K* classes

${\epsilon}_{t}=\frac{1}{2}{\displaystyle \sum}_{n=1}^{N}{\displaystyle \sum}_{k\ne {y}_{n}}{d}_{n,k}^{\left(t\right)}\left(1-{h}_{t}\left({x}_{n},{y}_{n}\right)+{h}_{t}\left({x}_{n},k\right)\right),$

where

*h*(_{t}*x*,_{n}*k*) is the confidence of prediction by learner at step*t*into class*k*ranging from 0 (not at all confident) to 1 (highly confident).${d}_{n,k}^{\left(t\right)}$ are observation weights at step

*t*for class*k*.*y*is the true class label taking one of the_{n}*K*values.The second sum is over all classes other than the true class

*y*._{n}

Interpreting the pseudo-loss is harder than classification error, but the idea is the
same. Pseudo-loss can be used as a measure of the classification accuracy from any learner
in an ensemble. Pseudo-loss typically exhibits the same behavior as a weighted
classification error for `AdaBoostM1`

: the first few learners in a
boosted ensemble give low pseudo-loss values. After the first few training steps, the
ensemble begins to learn at a slower pace, and the pseudo-loss value approaches 0.5 from
below.

For an example using `AdaBoostM2`

, see Predict Class Labels Using Classification Ensemble.

Gentle adaptive boosting (`GentleBoost`

, also known as Gentle
AdaBoost) combines features of `AdaBoostM1`

and
`LogitBoost`

. Like `AdaBoostM1`

,
`GentleBoost`

minimizes the exponential loss. But its numeric
optimization is set up differently. Like `LogitBoost`

, every weak learner
fits a regression model to response values *y _{n}* ∊ {–1,+1}.

`fitcensemble`

computes and stores the mean-squared error in the
`FitInfo`

property of the ensemble object. The mean-squared error
is

$\sum}_{n=1}^{N}{d}_{n}^{\left(t\right)}{\left({\tilde{y}}_{n}-{h}_{t}\left({x}_{n}\right)\right)}^{2},$

where

${d}_{n}^{\left(t\right)}$ are observation weights at step

*t*(the weights add up to 1).*h*(_{t}*x*) are predictions of the regression model_{n}*h*fitted to response values_{t}*y*._{n}

As the strength of individual learners weakens, the weighted mean-squared error approaches 1.

For examples using `GentleBoost`

, see Speed Up Training ECOC Classifiers Using Binning and Parallel Computing and Handle Imbalanced Data or Unequal Misclassification Costs in Classification Ensembles.

Adaptive logistic regression (`LogitBoost`

) is another popular
algorithm for binary classification. `LogitBoost`

works similarly to
`AdaBoostM1`

, except it minimizes binomial deviance

$\sum}_{n=1}^{N}{w}_{n}\mathrm{log}\left(1+\mathrm{exp}\left(-2{y}_{n}f\left({x}_{n}\right)\right)\right),$

where

*y*∊ {–1,+1} is the true class label._{n}*w*are observation weights normalized to add up to 1._{n}*f*(*x*) ∊ (–∞,+∞) is the predicted classification score._{n}

Binomial deviance assigns less weight to badly misclassified observations
(observations with large negative values of
*y _{n}f*(

`LogitBoost`

can give better average accuracy than
`AdaBoostM1`

for data with poorly separable classes.Learner *t* in a `LogitBoost`

ensemble fits a
regression model to response values

${\tilde{y}}_{n}=\frac{{y}_{n}^{*}-{p}_{t}\left({x}_{n}\right)}{{p}_{t}\left({x}_{n}\right)\left(1-{p}_{t}\left({x}_{n}\right)\right)},$

where

*y**∊ {0,+1} are relabeled classes (0 instead of –1)._{n}*p*(_{t}*x*) is the current ensemble estimate of the probability for observation_{n}*x*to be of class 1._{n}

`fitcensemble`

computes and stores the mean-squared error in the
`FitInfo`

property of the ensemble object. The mean-squared error
is

$\sum}_{n=1}^{N}{d}_{n}^{\left(t\right)}{\left({\tilde{y}}_{n}-{h}_{t}\left({x}_{n}\right)\right)}^{2},$

where

${d}_{n}^{\left(t\right)}$ are observation weights at step

*t*(the weights add up to 1).*h*(_{t}*x*) are predictions of the regression model_{n}*h*fitted to response values ${\tilde{y}}_{n}$._{t}

Values *y _{n}* can range from –∞ to
+∞, so the mean-squared error does not have well-defined bounds.

For examples using `LogitBoost`

, see Train Classification Ensemble and Speed Up Training by Binning Numeric Predictor Values.

Linear programming boosting (`LPBoost`

), like
`TotalBoost`

, performs multiclass classification by attempting to
maximize the minimal *margin* in the training set. This attempt uses
optimization algorithms, namely linear programming for `LPBoost`

. So you
need an Optimization Toolbox license to use `LPBoost`

or
`TotalBoost`

.

The margin of a classification is the difference between the predicted soft
classification *score* for the true class, and the largest score for
the false classes. For trees, the *score* of a classification of a
leaf node is the posterior probability of the classification at that node. The posterior
probability of the classification at a node is the number of training sequences that lead
to that node with the classification, divided by the number of training sequences that
lead to that node. For more information, see More About in `margin`

.

Why maximize the minimal margin? For one thing, the generalization error (the error on new data) is the probability of obtaining a negative margin. Schapire and Singer [10] establish this inequality on the probability of obtaining a negative margin:

$${P}_{\text{test}}\left(m\le 0\right)\le {P}_{\text{train}}\left(m\le \theta \right)+O\left(\frac{1}{\sqrt{N}}\sqrt{\frac{V{\mathrm{log}}^{2}(N/V)}{{\theta}^{2}}+\mathrm{log}(1/\delta )}\right).$$

Here *m* is the margin, *θ* is any positive number,
*V* is the Vapnik-Chervonenkis dimension of the classifier space,
*N* is the size of the training set, and *δ* is a
small positive number. The inequality holds with probability 1–*δ* over
many i.i.d. training and test sets. This inequality says: To obtain a low generalization
error, minimize the number of observations below margin *θ* in the
training set.

`LPBoost`

iteratively maximizes the minimal margin through a sequence
of linear programming problems. Equivalently, by duality, `LPBoost`

minimizes the maximal *edge*, where edge is the weighted mean margin
(see More About). At each iteration, there are more constraints in the problem. So,
for large problems, the optimization problem becomes increasingly constrained, and slow to
solve.

`LPBoost`

typically creates ensembles with many learners having
weights that are orders of magnitude smaller than those of other learners. Therefore, to
better enable you to remove the unimportant ensemble members, the `compact`

method reorders the members of an `LPBoost`

ensemble
from largest weight to smallest. Therefore, you can easily remove the least important
members of the ensemble using the `removeLearners`

method.

For an example using `LPBoost`

, see LPBoost and TotalBoost for Small Ensembles.

Least-squares boosting (`LSBoost`

) fits regression ensembles. At
every step, the ensemble fits a new learner to the difference between the observed
response and the aggregated prediction of all learners grown previously. The ensemble fits
to minimize mean-squared error.

You can use `LSBoost`

with shrinkage by passing in the
`LearnRate`

parameter. By default this parameter is set to
`1`

, and the ensemble learns at the maximal speed. If you set
`LearnRate`

to a value from `0`

to
`1`

, the ensemble fits every new learner to *y _{n}* –

*y*is the observed response._{n}*f*(*x*) is the aggregated prediction from all weak learners grown so far for observation_{n}*x*._{n}*η*is the learning rate.

For examples using `LSBoost`

, see Train Regression Ensemble, Optimize a Boosted Regression Ensemble, and Ensemble Regularization.

Boosting algorithms such as `AdaBoostM1`

and
`LogitBoost`

increase weights for misclassified observations at every
boosting step. These weights can become very large. If this happens, the boosting
algorithm sometimes concentrates on a few misclassified observations and neglects the
majority of training data. Consequently the average classification accuracy suffers. In
this situation, you can try using robust boosting (`RobustBoost`

). This
algorithm does not assign almost the entire data weight to badly misclassified
observations. It can produce better average classification accuracy. You need an
Optimization Toolbox license to use `RobustBoost`

.

Unlike `AdaBoostM1`

and `LogitBoost`

,
`RobustBoost`

does not minimize a specific loss function. Instead, it
maximizes the number of observations with the classification margin above a certain
threshold.

`RobustBoost`

trains based on time evolution. The algorithm starts at *t* = 0. At every step, `RobustBoost`

solves an optimization
problem to find a positive step in time Δ*t* and a corresponding positive
change in the average margin for training data Δ*m*.
`RobustBoost`

stops training and exits if at least one of these three
conditions is true:

Time

*t*reaches 1.`RobustBoost`

cannot find a solution to the optimization problem with positive updates Δ*t*and Δ*m*.`RobustBoost`

grows as many learners as you requested.

Results from `RobustBoost`

can be usable for any termination
condition. Estimate the classification accuracy by cross validation or by using an
independent test set.

To get better classification accuracy from `RobustBoost`

, you can
adjust three parameters in `fitcensemble`

:
`RobustErrorGoal`

, `RobustMaxMargin`

, and
`RobustMarginSigma`

. Start by varying values for
`RobustErrorGoal`

from 0 to 1. The maximal allowed value for
`RobustErrorGoal`

depends on the two other parameters. If you pass a
value that is too high, `fitcensemble`

produces an error message
showing the allowed range for `RobustErrorGoal`

.

For an example using `RobustBoost`

, see Tune RobustBoost.

Random undersampling boosting (`RUSBoost`

) is especially effective at
classifying imbalanced data, meaning some class in the training data has many fewer
members than another. RUS stands for Random Under Sampling. The algorithm takes
*N*, the number of members in the class with the fewest members in the
training data, as the basic unit for sampling. Classes with more members are under sampled
by taking only *N* observations of every class. In other words, if there
are *K* classes, then, for each weak learner in the ensemble,
`RUSBoost`

takes a subset of the data with *N*
observations from each of the *K* classes. The boosting procedure follows
the procedure in Adaptive Boosting for Multiclass Classification for reweighting and constructing the
ensemble.

When you construct a `RUSBoost`

ensemble, there is an optional
name-value pair called `RatioToSmallest`

. Give a vector of
*K* values, each value representing the multiple of
*N* to sample for the associated class. For example, if the smallest
class has *N* = 100 members, then `RatioToSmallest`

=
`[2,3,4]`

means each weak learner has 200 members in class 1, 300 in
class 2, and 400 in class 3. If `RatioToSmallest`

leads to a value that
is larger than the number of members in a particular class, then
`RUSBoost`

samples the members with replacement. Otherwise,
`RUSBoost`

samples the members without replacement.

For an example using `RUSBoost`

, see Classification with Imbalanced Data.

Totally corrective boosting (`TotalBoost`

), like linear programming
boost (`LPBoost`

), performs multiclass classification by attempting to
maximize the minimal *margin* in the training set. This attempt uses
optimization algorithms, namely quadratic programming for `TotalBoost`

.
So you need an Optimization Toolbox license to use `LPBoost`

or
`TotalBoost`

.

The margin of a classification is the difference between the predicted soft
classification *score* for the true class, and the largest score for
the false classes. For trees, the *score* of a classification of a
leaf node is the posterior probability of the classification at that node. The posterior
probability of the classification at a node is the number of training sequences that lead
to that node with the classification, divided by the number of training sequences that
lead to that node. For more information, see More About in `margin`

.

Why maximize the minimal margin? For one thing, the generalization error (the error on new data) is the probability of obtaining a negative margin. Schapire and Singer [10] establish this inequality on the probability of obtaining a negative margin:

$${P}_{\text{test}}\left(m\le 0\right)\le {P}_{\text{train}}\left(m\le \theta \right)+O\left(\frac{1}{\sqrt{N}}\sqrt{\frac{V{\mathrm{log}}^{2}(N/V)}{{\theta}^{2}}+\mathrm{log}(1/\delta )}\right).$$

Here *m* is the margin, *θ* is any positive number,
*V* is the Vapnik-Chervonenkis dimension of the classifier space,
*N* is the size of the training set, and *δ* is a
small positive number. The inequality holds with probability 1–*δ* over
many i.i.d. training and test sets. This inequality says: To obtain a low generalization
error, minimize the number of observations below margin *θ* in the
training set.

`TotalBoost`

minimizes a proxy of the Kullback-Leibler divergence
between the current weight distribution and the initial weight distribution, subject to
the constraint that the *edge* (the weighted margin) is below a
certain value. The proxy is a quadratic expansion of the divergence:

$$D(W,{W}_{0})={\displaystyle \sum _{n=1}^{N}\mathrm{log}\frac{W(n)}{{W}_{0}(n)}}\approx {\displaystyle \sum _{n=1}^{N}\left(1+\frac{W(n)}{{W}_{0}(n)}\right)\Delta +\frac{1}{2W(n)}{\Delta}^{2}},$$

where Δ is the difference between *W*(*n*), the
weights at the current and next iteration, and
*W*_{0}, the initial weight distribution, which is
uniform. This optimization formulation keeps weights from becoming zero. At each
iteration, there are more constraints in the problem. So, for large problems, the
optimization problem becomes increasingly constrained, and slow to solve.

`TotalBoost`

typically creates ensembles with many learners having
weights that are orders of magnitude smaller than those of other learners. Therefore, to
better enable you to remove the unimportant ensemble members, the `compact`

method reorders the members of a `TotalBoost`

ensemble from largest weight to smallest. Therefore you can easily remove the least
important members of the ensemble using the `removeLearners`

method.

For an example using `TotalBoost`

, see LPBoost and TotalBoost for Small Ensembles.

[1] Breiman, L. *Bagging
Predictors.* Machine Learning 26, 1996, pp. 123–140.

[2] Breiman, L. *Random
Forests.* Machine Learning 45, 2001, pp. 5–32.

[3] Breiman, L. `https://www.stat.berkeley.edu/~breiman/RandomForests/`

[4] Freund, Y. *A more
robust boosting algorithm.* arXiv:0905.2138v1, 2009.

[5] Freund, Y. and R. E. Schapire.
*A Decision-Theoretic Generalization of On-Line Learning and an Application to
Boosting.* J. of Computer and System Sciences, Vol. 55, 1997, pp.
119–139.

[6] Friedman, J. *Greedy
function approximation: A gradient boosting machine.* Annals of Statistics,
Vol. 29, No. 5, 2001, pp. 1189–1232.

[7] Friedman, J., T. Hastie, and R.
Tibshirani. *Additive logistic regression: A statistical view of
boosting.* Annals of Statistics, Vol. 28, No. 2, 2000, pp.
337–407.

[8] Hastie, T., R. Tibshirani, and
J. Friedman. *The Elements of Statistical Learning*, second edition.
New York: Springer, 2008.

[9] Ho, T. K. *The random
subspace method for constructing decision forests.* IEEE Transactions on
Pattern Analysis and Machine Intelligence, Vol. 20, No. 8, 1998, pp. 832–844.

[10] Schapire, R., and Y. Singer.
*Improved boosting algorithms using confidence-rated predictions.*
Machine Learning, Vol. 37, No. 3, 1999, pp. 297–336.

[11] Schapire, R. E. et al.
*Boosting the margin: A new explanation for the effectiveness of voting
methods.* Annals of Statistics, Vol. 26, No. 5, 1998, pp.
1651–1686.

[12] Seiffert, C., T. Khoshgoftaar,
J. Hulse, and A. Napolitano. *RUSBoost: Improving clasification performance when
training data is skewed.* 19th International Conference on Pattern Recognition,
2008, pp. 1–4.

[13] Warmuth, M., J. Liao, and G.
Ratsch. *Totally corrective boosting algorithms that maximize the
margin.* Proc. 23rd Int’l. Conf. on Machine Learning, ACM, New York, 2006, pp.
1001–1008.

`ClassificationBaggedEnsemble`

| `ClassificationDiscriminant`

| `ClassificationEnsemble`

| `ClassificationKNN`

| `ClassificationPartitionedEnsemble`

| `CompactClassificationEnsemble`

| `CompactRegressionEnsemble`

| `fitcensemble`

| `fitrensemble`

| `RegressionBaggedEnsemble`

| `RegressionEnsemble`

| `RegressionPartitionedEnsemble`

| `TreeBagger`