Main Content

The first step of creating and training a new convolutional neural network (ConvNet) is to define the network architecture. This topic explains the details of ConvNet layers, and the order they appear in a ConvNet. For a complete list of deep learning layers and how to create them, see List of Deep Learning Layers. To learn about LSTM networks for sequence classification and regression, see Long Short-Term Memory Networks. To learn how to create your own custom layers, see Define Custom Deep Learning Layers.

The network architecture can vary depending on the types and numbers of layers included. The types and number of layers included depends on the particular application or data. For example, if you have categorical responses, you must have a softmax layer and a classification layer, whereas if your response is continuous, you must have a regression layer at the end of the network. A smaller network with only one or two convolutional layers might be sufficient to learn on a small number of grayscale image data. On the other hand, for more complex data with millions of colored images, you might need a more complicated network with multiple convolutional and fully connected layers.

To specify the architecture of a deep network with all layers connected sequentially, create an array of layers directly. For example, to create a deep network which classifies 28-by-28 grayscale images into 10 classes, specify the layer array

layers = [ imageInputLayer([28 28 1]) convolution2dLayer(3,16,'Padding',1) batchNormalizationLayer reluLayer maxPooling2dLayer(2,'Stride',2) convolution2dLayer(3,32,'Padding',1) batchNormalizationLayer reluLayer fullyConnectedLayer(10) softmaxLayer classificationLayer];

`layers`

is an array of `Layer`

objects. You can then use
`layers`

as an input to the training function
`trainNetwork`

.To specify the architecture of a neural network with all layers connected sequentially,
create an array of layers directly. To specify the architecture of a network where layers
can have multiple inputs or outputs, use a `LayerGraph`

object.

Create an image input layer using `imageInputLayer`

.

An image input layer inputs images to a network and applies data normalization.

Specify the image size using the `inputSize`

argument. The size of an
image corresponds to the height, width, and the number of color channels of that image.
For example, for a grayscale image, the number of channels is 1, and for a color image
it is 3.

A 2-D convolutional layer applies sliding convolutional filters
to the input. Create a 2-D convolutional layer using `convolution2dLayer`

.

The convolutional layer consists of various components.^{[1]}

A convolutional layer consists of neurons that connect to subregions of the input images or
the outputs of the previous layer. The layer learns the features localized by these regions
while scanning through an image. When creating a layer using the `convolution2dLayer`

function, you can specify the size of these regions using
the `filterSize`

input argument.

For each region, the `trainNetwork`

function computes a dot product of the
weights and the input, and then adds a bias term. A set of weights that is applied to a
region in the image is called a *filter*. The filter moves along the
input image vertically and horizontally, repeating the same computation for each region. In
other words, the filter convolves the input.

This image shows a 3-by-3 filter scanning through the input. The lower map represents the input and the upper map represents the output.

The step size with which the filter moves is called a *stride*. You can
specify the step size with the `Stride`

name-value pair argument. The
local regions that the neurons connect to can overlap depending on the
`filterSize`

and `'Stride'`

values.

This image shows a 3-by-3 filter scanning through the input with a stride of 2. The lower map represents the input and the upper map represents the output.

The number of weights in a filter is *h* * *w* *
*c*, where *h* is the height, and *w*
is the width of the filter, respectively, and *c* is the number of channels
in the input. For example, if the input is a color image, the number of color channels is 3.
The number of filters determines the number of channels in the output of a convolutional
layer. Specify the number of filters using the `numFilters`

argument with
the `convolution2dLayer`

function.

A dilated convolution is a convolution in which the filters are expanded by spaces inserted
between the elements of the filter. Specify the dilation factor using the
`'DilationFactor'`

property.

Use dilated convolutions to increase the receptive field (the area of the input which the layer can see) of the layer without increasing the number of parameters or computation.

The layer expands the filters by inserting zeros between each filter element. The dilation
factor determines the step size for sampling the input or equivalently the upsampling factor
of the filter. It corresponds to an effective filter size of (*Filter Size*
– 1) .* *Dilation Factor* + 1. For example, a 3-by-3 filter with the
dilation factor `[2 2]`

is equivalent to a 5-by-5 filter with zeros between
the elements.

This image shows a 3-by-3 filter dilated by a factor of two scanning through the input. The lower map represents the input and the upper map represents the output.

As a filter moves along the input, it uses the same set of
weights and the same bias for the convolution, forming a *feature map*. Each
feature map is the result of a convolution using a different set of weights and a different
bias. Hence, the number of feature maps is equal to the number of filters. The total number of
parameters in a convolutional layer is
((*h***w***c* + 1)**Number of
Filters*), where 1 is the bias.

You can also apply zero padding to input image borders vertically and horizontally using the
`'Padding'`

name-value pair argument. Padding is rows or columns of
zeros added to the borders of an image input. By adjusting the padding, you can control the
output size of the layer.

This image shows a 3-by-3 filter scanning through the input with padding of size 1. The lower map represents the input and the upper map represents the output.

The output height and width of a convolutional layer is
(*Input Size* – ((*Filter Size* – 1)**Dilation
Factor* + 1) + 2**Padding*)/*Stride* + 1. This
value must be an integer for the whole image to be fully covered. If the combination of these
options does not lead the image to be fully covered, the software by default ignores the
remaining part of the image along the right and bottom edges in the convolution.

The product of the output height and width gives the total number of neurons in a feature map,
say *Map Size*. The total number of neurons (output size) in a
convolutional layer is *Map Size***Number of
Filters*.

For example, suppose that the input image is a 32-by-32-by-3 color image. For a convolutional layer with eight filters and a filter size of 5-by-5, the number of weights per filter is 5 * 5 * 3 = 75, and the total number of parameters in the layer is (75 + 1) * 8 = 608. If the stride is 2 in each direction and padding of size 2 is specified, then each feature map is 16-by-16. This is because (32 – 5 + 2 * 2)/2 + 1 = 16.5, and some of the outermost zero padding to the right and bottom of the image is discarded. Finally, the total number of neurons in the layer is 16 * 16 * 8 = 2048.

Usually, the results from these neurons pass through some form of nonlinearity, such as rectified linear units (ReLU).

You can adjust the learning rates and regularization options
for the layer using name-value pair arguments while defining the convolutional layer. If you
choose not to specify these options, then `trainNetwork`

uses the global
training options defined with the `trainingOptions`

function. For details on
global and layer training options, see Set Up Parameters and Train Convolutional Neural Network.

A convolutional neural network can consist of one or multiple convolutional layers. The number of convolutional layers depends on the amount and complexity of the data.

Create a batch normalization layer using `batchNormalizationLayer`

.

A batch normalization layer normalizes each input channel across a mini-batch. To speed up training of convolutional neural networks and reduce the sensitivity to network initialization, use batch normalization layers between convolutional layers and nonlinearities, such as ReLU layers.

The layer first normalizes the activations of each channel by subtracting the mini-batch mean
and dividing by the mini-batch standard deviation. Then, the layer shifts the input by a
learnable offset *β* and scales it by a learnable scale factor
*γ*. *β* and *γ* are themselves
learnable parameters that are updated during network training.

Batch normalization layers normalize the activations and gradients propagating through a
neural network, making network training an easier optimization problem. To take full
advantage of this fact, you can try increasing the learning rate. Since the optimization
problem is easier, the parameter updates can be larger and the network can learn faster. You
can also try reducing the L_{2} and dropout regularization. With batch
normalization layers, the activations of a specific image during training depend on which
images happen to appear in the same mini-batch. To take full advantage of this regularizing
effect, try shuffling the training data before every training epoch. To specify how often to
shuffle the data during training, use the `'Shuffle'`

name-value pair
argument of `trainingOptions`

.

Create a ReLU layer using `reluLayer`

.

A ReLU layer performs a threshold operation to each element of the input, where any value less than zero is set to zero.

Convolutional and batch normalization layers are usually followed by a nonlinear activation function such as a rectified linear unit (ReLU), specified by a ReLU layer. A ReLU layer performs a threshold operation to each element, where any input value less than zero is set to zero, that is,

$$f\left(x\right)=\{\begin{array}{cc}x,& x\ge 0\\ 0,& x<0\end{array}.$$

The ReLU layer does not change the size of its input.

There are other nonlinear activation layers that perform different operations and can improve the network accuracy for some applications. For a list of activation layers, see Activation Layers.

Create a cross channel normalization layer using `crossChannelNormalizationLayer`

.

A channel-wise local response (cross-channel) normalization layer carries out channel-wise normalization.

This layer performs a channel-wise local response normalization. It usually follows the ReLU activation layer. This layer replaces each element with a normalized value it obtains using the elements from a certain number of neighboring channels (elements in the normalization window). That is, for each element $$x$$ in the input, `trainNetwork`

computes a normalized value $${x}^{\text{'}}$$ using

$${x}^{\text{'}}=\frac{x}{{\left(K+\frac{\alpha *ss}{windowChannelSize}\right)}^{\beta}},$$

where *K*, *α*, and *β * are the hyperparameters in the normalization, and *ss* is the sum of squares of the elements in the normalization window [2]. You must specify the size of the normalization window using the `windowChannelSize`

argument of the `crossChannelNormalizationLayer`

function. You can also specify the hyperparameters using the `Alpha`

, `Beta`

, and `K`

name-value pair arguments.

The previous normalization formula is slightly different than what is presented in [2]. You can obtain the equivalent formula by multiplying the `alpha`

value by the `windowChannelSize`

.

A max pooling layer performs down-sampling by dividing the input into rectangular pooling regions, and computing the maximum of each region. Create a max pooling layer using `maxPooling2dLayer`

.

An average pooling layer performs down-sampling by dividing the input into rectangular pooling regions and computing the average values of each region. Create an average pooling layer using `averagePooling2dLayer`

.

Pooling layers follow the convolutional layers for down-sampling, hence, reducing the number of connections to the following layers. They do not perform any learning themselves, but reduce the number of parameters to be learned in the following layers. They also help reduce overfitting.

A max pooling layer returns the maximum values of rectangular regions of its input. The size of the rectangular regions is determined by the `poolSize`

argument of `maxPoolingLayer`

. For example, if `poolSize`

equals `[2,3]`

, then the layer returns the maximum value in regions of height 2 and width 3.An average pooling layer outputs the average values of rectangular regions of its input. The size of the rectangular regions is determined by the `poolSize`

argument of `averagePoolingLayer`

. For example, if `poolSize`

is [2,3], then the layer returns the average value of regions of height 2 and width 3.

Pooling layers scan through the input horizontally and vertically in step sizes you can specify using the `'Stride'`

name-value pair argument. If the pool size is smaller than or equal to the stride, then the pooling regions do not overlap.

For nonoverlapping regions (*Pool Size* and *Stride* are
equal), if the input to the pooling layer is *n*-by-*n*,
and the pooling region size is *h*-by-*h*, then the
pooling layer down-samples the regions by *h*
[6]. That is, the output of a max or average pooling layer for one channel of a convolutional
layer is
*n*/*h*-by-*n*/*h*.
For overlapping regions, the output of a pooling layer is (*Input Size* –
*Pool Size* + 2**Padding*)/*Stride* +
1.

Create a dropout layer using `dropoutLayer`

.

A dropout layer randomly sets input elements to zero with a given probability.

At training time, the layer randomly sets input elements to zero given by the dropout mask `rand(size(X))<Probability`

, where `X`

is the layer input and then scales the remaining elements by `1/(1-Probability)`

. This operation effectively changes the underlying network architecture between iterations and helps prevent the network from overfitting [7], [2]. A higher number results in more elements being dropped during training. At prediction time, the output of the layer is equal to its input.

Similar to max or average pooling layers, no learning takes place in this layer.

Create a fully connected layer using `fullyConnectedLayer`

.

A fully connected layer multiplies the input by a weight matrix and then adds a bias vector.

The convolutional (and down-sampling) layers are followed by one or more fully connected layers.

As the name suggests, all neurons in a fully connected layer connect to all the neurons in the previous layer. This layer combines all of the features (local information) learned by the previous layers across the image to identify the larger patterns. For classification problems, the last fully connected layer combines the features to classify the images. This is the reason that the `outputSize`

argument of the last fully connected layer of the network is equal to the number of classes of the data set. For regression problems, the output size must be equal to the number of response variables.

You can also adjust the learning rate and the regularization parameters for this layer using
the related name-value pair arguments when creating the fully connected layer. If you choose
not to adjust them, then `trainNetwork`

uses the global training
parameters defined by the `trainingOptions`

function. For details on
global and layer training options, see Set Up Parameters and Train Convolutional Neural Network.

A fully connected layer multiplies the input by a weight matrix *W* and then adds a bias vector *b*.

If the input to the layer is a sequence (for example, in an LSTM network), then the fully connected layer acts independently on each time step. For example, if the layer before the fully connected layer outputs an array *X* of size *D*-by-*N*-by-*S*, then the fully connected layer outputs an array *Z* of size `outputSize`

-by-*N*-by-*S*. At time step *t*, the corresponding entry of *Z* is $$W{X}_{t}+b$$, where $${X}_{t}$$ denotes time step *t* of *X*.

A softmax layer applies a softmax function to the input. Create a softmax layer using `softmaxLayer`

.

A classification layer computes the cross entropy loss for
multi-class classification problems with mutually exclusive classes. Create a classification layer using `classificationLayer`

.

For classification problems, a softmax layer and then a classification layer must follow the final fully connected layer.

The output unit activation function is the softmax function:

$${y}_{r}\left(x\right)=\frac{\mathrm{exp}\left({a}_{r}\left(x\right)\right)}{{\displaystyle \sum _{j=1}^{k}\mathrm{exp}\left({a}_{j}\left(x\right)\right)}},$$

where $$0\le {y}_{r}\le 1$$ and $$\sum _{j=1}^{k}{y}_{j}=1$$.

The softmax function is the output unit activation function after the last fully connected layer for multi-class classification problems:

$$P\left({c}_{r}|x,\theta \right)=\frac{P\left(x,\theta |{c}_{r}\right)P\left({c}_{r}\right)}{{\displaystyle \sum _{j=1}^{k}P\left(x,\theta |{c}_{j}\right)P\left({c}_{j}\right)}}=\frac{\mathrm{exp}\left({a}_{r}\left(x,\theta \right)\right)}{{\displaystyle \sum _{j=1}^{k}\mathrm{exp}\left({a}_{j}\left(x,\theta \right)\right)}},$$

where $$0\le P\left({c}_{r}|x,\theta \right)\le 1$$ and $$\sum _{j=1}^{k}P\left({c}_{j}|x,\theta \right)=1$$. Moreover, $${a}_{r}=\mathrm{ln}\left(P\left(x,\theta |{c}_{r}\right)P\left({c}_{r}\right)\right)$$, $$P\left(x,\theta |{c}_{r}\right)$$ is the conditional probability of the sample given class *r*, and $$P\left({c}_{r}\right)$$ is the class prior probability.

The softmax function is also known as the *normalized exponential* and can be considered the multi-class generalization of the logistic sigmoid function [8].

For typical classification networks, the classification layer must follow
the softmax layer. In the classification layer, `trainNetwork`

takes the
values from the softmax function and assigns each input to one of the *K*
mutually exclusive classes using the cross entropy function for a 1-of-*K*
coding scheme [8]:

$$\text{loss}=-{\displaystyle \sum _{i=1}^{N}{\displaystyle \sum}_{j=1}^{K}}\text{}{\text{t}}_{ij}\mathrm{ln}{y}_{ij},$$

where *N* is the number of samples, *K*
is the number of classes, $${t}_{ij}$$ is the indicator that the *i*th sample belongs to the
*j*th class, and $${y}_{ij}$$ is the output for sample *i* for class
*j*, which in this case, is the value from the softmax function. That
is, it is the probability that the network associates the *i*th input with
class *j*.

Create a regression layer using `regressionLayer`

.

A regression layer computes the half-mean-squared-error loss for regression problems. For typical regression problems, a regression layer must follow the final fully connected layer.

For a single observation, the mean-squared-error is given by:

$$\text{MSE}={\displaystyle \sum}_{i=1}^{R}\frac{{({t}_{i}-{y}_{i})}^{2}}{R},$$

where *R* is the number of responses,
*t _{i}* is the target output, and

For image and sequence-to-one regression networks, the loss function of the regression
layer is the half-mean-squared-error of the predicted responses, not normalized by
*R*:

$$\text{loss}=\frac{1}{2}{\displaystyle \sum}_{i=1}^{R}{({t}_{i}-{y}_{i})}^{2}.$$

For image-to-image regression networks, the loss function of the regression layer is the
half-mean-squared-error of the predicted responses for each pixel, not normalized by
*R*:

$$\text{loss}=\frac{1}{2}{\displaystyle \sum}_{p=1}^{HWC}{({t}_{p}-{y}_{p})}^{2},$$

where *H*, *W*, and
*C* denote the height, width, and number of channels of the output
respectively, and *p* indexes into each element (pixel) of
*t* and *y* linearly.

For sequence-to-sequence regression networks, the loss function of the regression layer is
the half-mean-squared-error of the predicted responses for each time step, not normalized by
*R*:

$$\text{loss}=\frac{1}{2S}{\displaystyle \sum}_{i=1}^{S}{\displaystyle \sum}_{j=1}^{R}{({t}_{ij}-{y}_{ij})}^{2},$$

where *S* is the sequence length.

When training, the software calculates the mean loss over the observations in the mini-batch.

[1] Murphy, K. P. *Machine Learning: A Probabilistic
Perspective*. Cambridge, Massachusetts: The MIT Press,
2012.

[2] Krizhevsky, A., I. Sutskever, and G. E. Hinton. "ImageNet
Classification with Deep Convolutional Neural Networks." *Advances in Neural
Information Processing Systems*. Vol 25, 2012.

[3] LeCun, Y., Boser, B., Denker, J.S., Henderson, D., Howard,
R.E., Hubbard, W., Jackel, L.D., et al. ''Handwritten Digit Recognition with a
Back-propagation Network.'' In *Advances of Neural Information Processing
Systems*, 1990.

[4] LeCun, Y., L. Bottou, Y. Bengio, and P. Haffner.
''Gradient-based Learning Applied to Document Recognition.'' *Proceedings of
the IEEE.* Vol 86, pp. 2278–2324, 1998.

[5] Nair, V. and G. E. Hinton. "Rectified linear units improve restricted boltzmann machines." In Proc. 27th International Conference on Machine Learning, 2010.

[6] Nagi, J., F. Ducatelle, G. A. Di Caro, D. Ciresan, U. Meier,
A. Giusti, F. Nagi, J. Schmidhuber, L. M. Gambardella. ''Max-Pooling Convolutional
Neural Networks for Vision-based Hand Gesture Recognition''. *IEEE
International Conference on Signal and Image Processing Applications
(ICSIPA2011)*, 2011.

[7] Srivastava, N., G. Hinton, A. Krizhevsky, I. Sutskever, R.
Salakhutdinov. "Dropout: A Simple Way to Prevent Neural Networks from Overfitting."
*Journal of Machine Learning Research*. Vol. 15, pp. 1929-1958,
2014.

[8] Bishop, C. M. *Pattern Recognition and Machine
Learning*. Springer, New York, NY, 2006.

[9] Ioffe, Sergey, and Christian Szegedy. "Batch normalization: Accelerating deep
network training by reducing internal covariate shift." *preprint, arXiv:1502.03167* (2015).

`averagePooling2dLayer`

| `batchNormalizationLayer`

| `classificationLayer`

| `clippedReluLayer`

| `convolution2dLayer`

| `crossChannelNormalizationLayer`

| `dropoutLayer`

| `fullyConnectedLayer`

| `imageInputLayer`

| `leakyReluLayer`

| `maxPooling2dLayer`

| `regressionLayer`

| `reluLayer`

| `softmaxLayer`

| `trainingOptions`

| `trainNetwork`

^{[1]} Image credit: Convolution arithmetic (License)