# templateLinear

Linear classification learner template

## Syntax

``t = templateLinear()``
``t = templateLinear(Name,Value)``

## Description

`templateLinear` creates a template suitable for fitting a linear classification model to high-dimensional data for multiclass problems.

The template specifies the binary learner model, regularization type and strength, and solver, among other things. After creating the template, train the model by passing the template and data to `fitcecoc`.

example

``t = templateLinear()` returns a linear classification learner template.If you specify a default template, then the software uses default values for all input arguments during training.`

example

``t = templateLinear(Name,Value)` returns a template with additional options specified by one or more name-value pair arguments. For example, you can specify to implement logistic regression, specify the regularization type or strength, or specify the solver to use for objective-function minimization.If you display `t` in the Command Window, then all options appear empty (`[]`) except options that you specify using name-value pair arguments. During training, the software uses default values for empty options.`

## Examples

collapse all

Train an ECOC model composed of multiple binary, linear classification models.

`load nlpdata`

`X` is a sparse matrix of predictor data, and `Y` is a categorical vector of class labels. There are more than two classes in the data.

Create a default linear-classification-model template.

`t = templateLinear();`

To adjust the default values, see the Name-Value Pair Arguments on `templateLinear` page.

Train an ECOC model composed of multiple binary, linear classification models that can identify the product given the frequency distribution of words on a documentation web page. For faster training time, transpose the predictor data, and specify that observations correspond to columns.

```X = X'; rng(1); % For reproducibility Mdl = fitcecoc(X,Y,'Learners',t,'ObservationsIn','columns')```
```Mdl = CompactClassificationECOC ResponseName: 'Y' ClassNames: [comm dsp ecoder fixedpoint ... ] ScoreTransform: 'none' BinaryLearners: {78x1 cell} CodingMatrix: [13x78 double] Properties, Methods ```

Alternatively, you can train an ECOC model composed of default linear classification models using `'Learners','Linear'`.

To conserve memory, `fitcecoc` returns trained ECOC models composed of linear classification learners in `CompactClassificationECOC` model objects.

## Input Arguments

collapse all

### Name-Value Arguments

Specify optional pairs of arguments as `Name1=Value1,...,NameN=ValueN`, where `Name` is the argument name and `Value` is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

Before R2021a, use commas to separate each name and value, and enclose `Name` in quotes.

Example: `'Learner','logistic','Regularization','lasso','CrossVal','on'` specifies to implement logistic regression with a lasso penalty, and to implement 10-fold cross-validation.

Linear Classification Options

collapse all

Regularization term strength, specified as the comma-separated pair consisting of `'Lambda'` and `'auto'`, a nonnegative scalar, or a vector of nonnegative values.

• For `'auto'`, `Lambda` = 1/n.

• If you specify a cross-validation, name-value pair argument (e.g., `CrossVal`), then n is the number of in-fold observations.

• Otherwise, n is the training sample size.

• For a vector of nonnegative values, `templateLinear` sequentially optimizes the objective function for each distinct value in `Lambda` in ascending order.

• If `Solver` is `'sgd'` or `'asgd'` and `Regularization` is `'lasso'`, `templateLinear` does not use the previous coefficient estimates as a warm start for the next optimization iteration. Otherwise, `templateLinear` uses warm starts.

• If `Regularization` is `'lasso'`, then any coefficient estimate of 0 retains its value when `templateLinear` optimizes using subsequent values in `Lambda`.

• `templateLinear` returns coefficient estimates for each specified regularization strength.

Example: `'Lambda',10.^(-(10:-2:2))`

Data Types: `char` | `string` | `double` | `single`

Linear classification model type, specified as the comma-separated pair consisting of `'Learner'` and `'svm'` or `'logistic'`.

In this table, $f\left(x\right)=x\beta +b.$

• β is a vector of p coefficients.

• x is an observation from p predictor variables.

• b is the scalar bias.

ValueAlgorithmResponse RangeLoss Function
`'svm'`Support vector machiney ∊ {–1,1}; 1 for the positive class and –1 otherwiseHinge: $\ell \left[y,f\left(x\right)\right]=\mathrm{max}\left[0,1-yf\left(x\right)\right]$
`'logistic'`Logistic regressionSame as `'svm'`Deviance (logistic): $\ell \left[y,f\left(x\right)\right]=\mathrm{log}\left\{1+\mathrm{exp}\left[-yf\left(x\right)\right]\right\}$

Example: `'Learner','logistic'`

Complexity penalty type, specified as the comma-separated pair consisting of `'Regularization'` and `'lasso'` or `'ridge'`.

The software composes the objective function for minimization from the sum of the average loss function (see `Learner`) and the regularization term in this table.

ValueDescription
`'lasso'`Lasso (L1) penalty: $\lambda \sum _{j=1}^{p}|{\beta }_{j}|$
`'ridge'`Ridge (L2) penalty: $\frac{\lambda }{2}\sum _{j=1}^{p}{\beta }_{j}^{2}$

To specify the regularization term strength, which is λ in the expressions, use `Lambda`.

The software excludes the bias term (β0) from the regularization penalty.

If `Solver` is `'sparsa'`, then the default value of `Regularization` is `'lasso'`. Otherwise, the default is `'ridge'`.

Tip

• For predictor variable selection, specify `'lasso'`. For more on variable selection, see Introduction to Feature Selection.

• For optimization accuracy, specify `'ridge'`.

Example: `'Regularization','lasso'`

Objective function minimization technique, specified as the comma-separated pair consisting of `'Solver'` and a character vector or string scalar, a string array, or a cell array of character vectors with values from this table.

ValueDescriptionRestrictions
`'sgd'`Stochastic gradient descent (SGD) [4][2]
`'asgd'`Average stochastic gradient descent (ASGD) [7]
`'dual'`Dual SGD for SVM [1][6]`Regularization` must be `'ridge'` and `Learner` must be `'svm'`.
`'bfgs'`Broyden-Fletcher-Goldfarb-Shanno quasi-Newton algorithm (BFGS) [3]Inefficient if `X` is very high-dimensional.
`'lbfgs'`Limited-memory BFGS (LBFGS) [3]`Regularization` must be `'ridge'`.
`'sparsa'`Sparse Reconstruction by Separable Approximation (SpaRSA) [5]`Regularization` must be `'lasso'`.

If you specify:

• A ridge penalty (see `Regularization`) and the predictor data set contains 100 or fewer predictor variables, then the default solver is `'bfgs'`.

• An SVM model (see `Learner`), a ridge penalty, and the predictor data set contains more than 100 predictor variables, then the default solver is `'dual'`.

• A lasso penalty and the predictor data set contains 100 or fewer predictor variables, then the default solver is `'sparsa'`.

Otherwise, the default solver is `'sgd'`. Note that the default solver can change when you perform hyperparameter optimization. For more information, see Regularization method determines the linear learner solver used during hyperparameter optimization.

If you specify a string array or cell array of solver names, then, for each value in `Lambda`, the software uses the solutions of solver j as a warm start for solver j + 1.

Example: `{'sgd' 'lbfgs'}` applies SGD to solve the objective, and uses the solution as a warm start for LBFGS.

Tip

• SGD and ASGD can solve the objective function more quickly than other solvers, whereas LBFGS and SpaRSA can yield more accurate solutions than other solvers. Solver combinations like `{'sgd' 'lbfgs'}` and ```{'sgd' 'sparsa'}``` can balance optimization speed and accuracy.

• When choosing between SGD and ASGD, consider that:

• SGD takes less time per iteration, but requires more iterations to converge.

• ASGD requires fewer iterations to converge, but takes more time per iteration.

• If the predictor data is high dimensional and `Regularization` is `'ridge'`, set `Solver` to any of these combinations:

• `'sgd'`

• `'asgd'`

• `'dual'` if `Learner` is `'svm'`

• `'lbfgs'`

• `{'sgd','lbfgs'}`

• `{'asgd','lbfgs'}`

• `{'dual','lbfgs'}` if `Learner` is `'svm'`

Although you can set other combinations, they often lead to solutions with poor accuracy.

• If the predictor data is moderate through low dimensional and `Regularization` is `'ridge'`, set `Solver` to `'bfgs'`.

• If `Regularization` is `'lasso'`, set `Solver` to any of these combinations:

• `'sgd'`

• `'asgd'`

• `'sparsa'`

• `{'sgd','sparsa'}`

• `{'asgd','sparsa'}`

Example: `'Solver',{'sgd','lbfgs'}`

Initial linear coefficient estimates (β), specified as the comma-separated pair consisting of `'Beta'` and a p-dimensional numeric vector or a p-by-L numeric matrix. p is the number of predictor variables in `X` and L is the number of regularization-strength values (for more details, see `Lambda`).

• If you specify a p-dimensional vector, then the software optimizes the objective function L times using this process.

1. The software optimizes using `Beta` as the initial value and the minimum value of `Lambda` as the regularization strength.

2. The software optimizes again using the resulting estimate from the previous optimization as a warm start, and the next smallest value in `Lambda` as the regularization strength.

3. The software implements step 2 until it exhausts all values in `Lambda`.

• If you specify a p-by-L matrix, then the software optimizes the objective function L times. At iteration `j`, the software uses `Beta(:,j)` as the initial value and, after it sorts `Lambda` in ascending order, uses `Lambda(j)` as the regularization strength.

If you set `'Solver','dual'`, then the software ignores `Beta`.

Data Types: `single` | `double`

Initial intercept estimate (b), specified as the comma-separated pair consisting of `'Bias'` and a numeric scalar or an L-dimensional numeric vector. L is the number of regularization-strength values (for more details, see `Lambda`).

• If you specify a scalar, then the software optimizes the objective function L times using this process.

1. The software optimizes using `Bias` as the initial value and the minimum value of `Lambda` as the regularization strength.

2. The uses the resulting estimate as a warm start to the next optimization iteration, and uses the next smallest value in `Lambda` as the regularization strength.

3. The software implements step 2 until it exhausts all values in `Lambda`.

• If you specify an L-dimensional vector, then the software optimizes the objective function L times. At iteration `j`, the software uses `Bias(j)` as the initial value and, after it sorts `Lambda` in ascending order, uses `Lambda(j)` as the regularization strength.

• By default:

• If `Learner` is `'logistic'`, then let gj be 1 if `Y(j)` is the positive class, and -1 otherwise. `Bias` is the weighted average of the g for training or, for cross-validation, in-fold observations.

• If `Learner` is `'svm'`, then `Bias` is 0.

Data Types: `single` | `double`

Linear model intercept inclusion flag, specified as the comma-separated pair consisting of `'FitBias'` and `true` or `false`.

ValueDescription
`true`The software includes the bias term b in the linear model, and then estimates it.
`false`The software sets b = 0 during estimation.

Example: `'FitBias',false`

Data Types: `logical`

Flag to fit the linear model intercept after optimization, specified as the comma-separated pair consisting of `'PostFitBias'` and `true` or `false`.

ValueDescription
`false`The software estimates the bias term b and the coefficients β during optimization.
`true`

To estimate b, the software:

1. Estimates β and b using the model

2. Estimates classification scores

3. Refits b by placing the threshold on the classification scores that attains maximum accuracy

If you specify `true`, then `FitBias` must be true.

Example: `'PostFitBias',true`

Data Types: `logical`

Verbosity level, specified as the comma-separated pair consisting of `'Verbose'` and either `0` or `1`. `Verbose` controls the display of diagnostic information at the command line.

ValueDescription
`0``templateLinear` does not display diagnostic information.
`1``templateLinear` periodically displays the value of the objective function, gradient magnitude, and other diagnostic information.

Example: `'Verbose',1`

Data Types: `single` | `double`

SGD and ASGD Solver Options

collapse all

Mini-batch size, specified as the comma-separated pair consisting of `'BatchSize'` and a positive integer. At each iteration, the software estimates the gradient using `BatchSize` observations from the training data.

• If the predictor data is a numeric matrix, then the default value is `10`.

• If the predictor data is a sparse matrix, then the default value is `max([10,ceil(sqrt(ff))])`, where `ff = numel(X)/nnz(X)`, that is, the fullness factor of `X`.

Example: `'BatchSize',100`

Data Types: `single` | `double`

Learning rate, specified as the comma-separated pair consisting of `'LearnRate'` and a positive scalar. `LearnRate` controls the optimization step size by scaling the subgradient.

• If `Regularization` is `'ridge'`, then `LearnRate` specifies the initial learning rate γ0. `templateLinear` determines the learning rate for iteration t, γt, using

`${\gamma }_{t}=\frac{{\gamma }_{0}}{{\left(1+\lambda {\gamma }_{0}t\right)}^{c}}.$`

• If `Regularization` is `'lasso'`, then, for all iterations, `LearnRate` is constant.

By default, `LearnRate` is `1/sqrt(1+max((sum(X.^2,obsDim))))`, where `obsDim` is `1` if the observations compose the columns of the predictor data `X`, and `2` otherwise.

Example: `'LearnRate',0.01`

Data Types: `single` | `double`

Flag to decrease the learning rate when the software detects divergence (that is, over-stepping the minimum), specified as the comma-separated pair consisting of `'OptimizeLearnRate'` and `true` or `false`.

If `OptimizeLearnRate` is `'true'`, then:

1. For the few optimization iterations, the software starts optimization using `LearnRate` as the learning rate.

2. If the value of the objective function increases, then the software restarts and uses half of the current value of the learning rate.

3. The software iterates step 2 until the objective function decreases.

Example: `'OptimizeLearnRate',true`

Data Types: `logical`

Number of mini-batches between lasso truncation runs, specified as the comma-separated pair consisting of `'TruncationPeriod'` and a positive integer.

After a truncation run, the software applies a soft threshold to the linear coefficients. That is, after processing k = `TruncationPeriod` mini-batches, the software truncates the estimated coefficient j using

`${\stackrel{^}{\beta }}_{j}^{\ast }=\left\{\begin{array}{ll}{\stackrel{^}{\beta }}_{j}-{u}_{t}\hfill & \text{if}\text{\hspace{0.17em}}{\stackrel{^}{\beta }}_{j}>{u}_{t},\hfill \\ 0\hfill & \text{if}\text{\hspace{0.17em}}|{\stackrel{^}{\beta }}_{j}|\le {u}_{t},\hfill \\ {\stackrel{^}{\beta }}_{j}+{u}_{t}\hfill & \text{if}\text{\hspace{0.17em}}{\stackrel{^}{\beta }}_{j}<-{u}_{t}.\hfill \end{array}\begin{array}{r}\hfill \text{\hspace{0.17em}}\text{\hspace{0.17em}}\\ \hfill \text{\hspace{0.17em}}\text{\hspace{0.17em}}\\ \hfill \text{\hspace{0.17em}}\text{\hspace{0.17em}}\end{array}$`

• For SGD, ${\stackrel{^}{\beta }}_{j}$ is the estimate of coefficient j after processing k mini-batches. ${u}_{t}=k{\gamma }_{t}\lambda .$ γt is the learning rate at iteration t. λ is the value of `Lambda`.

• For ASGD, ${\stackrel{^}{\beta }}_{j}$ is the averaged estimate coefficient j after processing k mini-batches, ${u}_{t}=k\lambda .$

If `Regularization` is `'ridge'`, then the software ignores `TruncationPeriod`.

Example: `'TruncationPeriod',100`

Data Types: `single` | `double`

SGD and ASGD Convergence Controls

collapse all

Maximal number of batches to process, specified as the comma-separated pair consisting of `'BatchLimit'` and a positive integer. When the software processes `BatchLimit` batches, it terminates optimization.

• By default:

• The software passes through the data `PassLimit` times.

• If you specify multiple solvers, and use (A)SGD to get an initial approximation for the next solver, then the default value is `ceil(1e6/BatchSize)`. `BatchSize` is the value of the `'``BatchSize``'` name-value pair argument.

• If you specify `BatchLimit`, then `templateLinear` uses the argument that results in processing the fewest observations, either `BatchLimit` or `PassLimit`.

Example: `'BatchLimit',100`

Data Types: `single` | `double`

Relative tolerance on the linear coefficients and the bias term (intercept), specified as the comma-separated pair consisting of `'BetaTolerance'` and a nonnegative scalar.

Let ${B}_{t}=\left[{\beta }_{t}{}^{\prime }\text{\hspace{0.17em}}\text{\hspace{0.17em}}{b}_{t}\right]$, that is, the vector of the coefficients and the bias term at optimization iteration t. If ${‖\frac{{B}_{t}-{B}_{t-1}}{{B}_{t}}‖}_{2}<\text{BetaTolerance}$, then optimization terminates.

If the software converges for the last solver specified in `Solver`, then optimization terminates. Otherwise, the software uses the next solver specified in `Solver`.

Example: `'BetaTolerance',1e-6`

Data Types: `single` | `double`

Number of batches to process before next convergence check, specified as the comma-separated pair consisting of `'NumCheckConvergence'` and a positive integer.

To specify the batch size, see `BatchSize`.

The software checks for convergence about 10 times per pass through the entire data set by default.

Example: `'NumCheckConvergence',100`

Data Types: `single` | `double`

Maximal number of passes through the data, specified as the comma-separated pair consisting of `'PassLimit'` and a positive integer.

The software processes all observations when it completes one pass through the data.

When the software passes through the data `PassLimit` times, it terminates optimization.

If you specify `BatchLimit`, then `templateLinear` uses the argument that results in processing the fewest observations, either `BatchLimit` or `PassLimit`.

Example: `'PassLimit',5`

Data Types: `single` | `double`

Dual SGD Convergence Controls

collapse all

Relative tolerance on the linear coefficients and the bias term (intercept), specified as the comma-separated pair consisting of `'BetaTolerance'` and a nonnegative scalar.

Let ${B}_{t}=\left[{\beta }_{t}{}^{\prime }\text{\hspace{0.17em}}\text{\hspace{0.17em}}{b}_{t}\right]$, that is, the vector of the coefficients and the bias term at optimization iteration t. If ${‖\frac{{B}_{t}-{B}_{t-1}}{{B}_{t}}‖}_{2}<\text{BetaTolerance}$, then optimization terminates.

If you also specify `DeltaGradientTolerance`, then optimization terminates when the software satisfies either stopping criterion.

If the software converges for the last solver specified in `Solver`, then optimization terminates. Otherwise, the software uses the next solver specified in `Solver`.

Example: `'BetaTolerance',1e-6`

Data Types: `single` | `double`

Gradient-difference tolerance between upper and lower pool Karush-Kuhn-Tucker (KKT) complementarity conditions violators, specified as the comma-separated pair consisting of `'DeltaGradientTolerance'` and a nonnegative scalar.

• If the magnitude of the KKT violators is less than `DeltaGradientTolerance`, then the software terminates optimization.

• If the software converges for the last solver specified in `Solver`, then optimization terminates. Otherwise, the software uses the next solver specified in `Solver`.

Example: `'DeltaGapTolerance',1e-2`

Data Types: `double` | `single`

Number of passes through entire data set to process before next convergence check, specified as the comma-separated pair consisting of `'NumCheckConvergence'` and a positive integer.

Example: `'NumCheckConvergence',100`

Data Types: `single` | `double`

Maximal number of passes through the data, specified as the comma-separated pair consisting of `'PassLimit'` and a positive integer.

When the software completes one pass through the data, it has processed all observations.

When the software passes through the data `PassLimit` times, it terminates optimization.

Example: `'PassLimit',5`

Data Types: `single` | `double`

BFGS, LBFGS, and SpaRSA Convergence Controls

collapse all

Relative tolerance on the linear coefficients and the bias term (intercept), specified as the comma-separated pair consisting of `'BetaTolerance'` and a nonnegative scalar.

Let ${B}_{t}=\left[{\beta }_{t}{}^{\prime }\text{\hspace{0.17em}}\text{\hspace{0.17em}}{b}_{t}\right]$, that is, the vector of the coefficients and the bias term at optimization iteration t. If ${‖\frac{{B}_{t}-{B}_{t-1}}{{B}_{t}}‖}_{2}<\text{BetaTolerance}$, then optimization terminates.

If you also specify `GradientTolerance`, then optimization terminates when the software satisfies either stopping criterion.

If the software converges for the last solver specified in `Solver`, then optimization terminates. Otherwise, the software uses the next solver specified in `Solver`.

Example: `'BetaTolerance',1e-6`

Data Types: `single` | `double`

Absolute gradient tolerance, specified as the comma-separated pair consisting of `'GradientTolerance'` and a nonnegative scalar.

Let $\nabla {ℒ}_{t}$ be the gradient vector of the objective function with respect to the coefficients and bias term at optimization iteration t. If ${‖\nabla {ℒ}_{t}‖}_{\infty }=\mathrm{max}|\nabla {ℒ}_{t}|<\text{GradientTolerance}$, then optimization terminates.

If you also specify `BetaTolerance`, then optimization terminates when the software satisfies either stopping criterion.

If the software converges for the last solver specified in the software, then optimization terminates. Otherwise, the software uses the next solver specified in `Solver`.

Example: `'GradientTolerance',1e-5`

Data Types: `single` | `double`

Size of history buffer for Hessian approximation, specified as the comma-separated pair consisting of `'HessianHistorySize'` and a positive integer. That is, at each iteration, the software composes the Hessian using statistics from the latest `HessianHistorySize` iterations.

The software does not support `'HessianHistorySize'` for SpaRSA.

Example: `'HessianHistorySize',10`

Data Types: `single` | `double`

Maximal number of optimization iterations, specified as the comma-separated pair consisting of `'IterationLimit'` and a positive integer. `IterationLimit` applies to these values of `Solver`: `'bfgs'`, `'lbfgs'`, and `'sparsa'`.

Example: `'IterationLimit',500`

Data Types: `single` | `double`

## Output Arguments

collapse all

Linear classification model learner template, returned as a template object. To train a linear classification model using high-dimensional data for multiclass problems, pass `t` to `fitcecoc`.

If you display `t` to the Command Window, then all, unspecified options appear empty (`[]`). However, the software replaces empty options with their corresponding default values during training.

collapse all

### Warm Start

A warm start is initial estimates of the beta coefficients and bias term supplied to an optimization routine for quicker convergence.

## Tips

• It is a best practice to orient your predictor matrix so that observations correspond to columns and to specify `'ObservationsIn','columns'`. As a result, you can experience a significant reduction in optimization-execution time.

• If the predictor data has few observations, but many predictor variables, then:

• Specify `'PostFitBias',true`.

• For SGD or ASGD solvers, set `PassLimit` to a positive integer that is greater than 1, for example, 5 or 10. This setting often results in better accuracy.

• For SGD and ASGD solvers, `BatchSize` affects the rate of convergence.

• If `BatchSize` is too small, then the software achieves the minimum in many iterations, but computes the gradient per iteration quickly.

• If `BatchSize` is too large, then the software achieves the minimum in fewer iterations, but computes the gradient per iteration slowly.

• Large learning rate (see `LearnRate`) speed-up convergence to the minimum, but can lead to divergence (that is, over-stepping the minimum). Small learning rates ensure convergence to the minimum, but can lead to slow termination.

• If `Regularization` is `'lasso'`, then experiment with various values of `TruncationPeriod`. For example, set `TruncationPeriod` to `1`, `10`, and then `100`.

• For efficiency, the software does not standardize predictor data. To standardize the predictor data (`X`), enter

`X = bsxfun(@rdivide,bsxfun(@minus,X,mean(X,2)),std(X,0,2));`

The code requires that you orient the predictors and observations as the rows and columns of `X`, respectively. Also, for memory-usage economy, the code replaces the original predictor data the standardized data.

## References

[1] Hsieh, C. J., K. W. Chang, C. J. Lin, S. S. Keerthi, and S. Sundararajan. “A Dual Coordinate Descent Method for Large-Scale Linear SVM.” Proceedings of the 25th International Conference on Machine Learning, ICML ’08, 2001, pp. 408–415.

[2] Langford, J., L. Li, and T. Zhang. “Sparse Online Learning Via Truncated Gradient.” J. Mach. Learn. Res., Vol. 10, 2009, pp. 777–801.

[3] Nocedal, J. and S. J. Wright. Numerical Optimization, 2nd ed., New York: Springer, 2006.

[4] Shalev-Shwartz, S., Y. Singer, and N. Srebro. “Pegasos: Primal Estimated Sub-Gradient Solver for SVM.” Proceedings of the 24th International Conference on Machine Learning, ICML ’07, 2007, pp. 807–814.

[5] Wright, S. J., R. D. Nowak, and M. A. T. Figueiredo. “Sparse Reconstruction by Separable Approximation.” Trans. Sig. Proc., Vol. 57, No 7, 2009, pp. 2479–2493.

[6] Xiao, Lin. “Dual Averaging Methods for Regularized Stochastic Learning and Online Optimization.” J. Mach. Learn. Res., Vol. 11, 2010, pp. 2543–2596.

[7] Xu, Wei. “Towards Optimal One Pass Large Scale Learning with Averaged Stochastic Gradient Descent.” CoRR, abs/1107.2490, 2011.

## Version History

Introduced in R2016a