Classify Out-of-Memory Text Data Using Custom Mini-Batch Datastore

This example shows how to classify out-of-memory text data with a deep learning network using a custom mini-batch datastore.

A mini-batch datastore is an implementation of a datastore with support for reading data in batches. You can use a mini-batch datastore as a source of training, validation, test, and prediction data sets for deep learning applications. Use mini-batch datastores to read out-of-memory data or to perform specific preprocessing operations when reading batches of data.

When training the network, the software creates mini-batches of sequences of the same length by padding, truncating, or splitting the input data. The trainingOptions function provides options to pad and truncate input sequences, however, these options are not well suited for sequences of word vectors. Furthermore, this function does not support padding data in a custom datastore. Instead, you must pad and truncate the sequences manually. If you left-pad and truncate the sequences of word vectors, then the training might improve.

The Classify Text Data Using Deep Learning example manually truncates and pads all the documents to the same length. This process adds lots of padding to very short documents and discards lots of data from very long documents.

Alternatively, to prevent adding too much padding or discarding too much data, create a custom mini-batch datastore that inputs mini-batches into the network. The custom mini-batch datastore textDatastore.m converts mini-batches of documents to sequences or word indices and left-pads each mini-batch to the length of the longest document in the mini-batch. For sorted data, this datastore can help reduce the amount of padding added to the data since documents are not padded to a fixed length. Similarly, the datastore does not discard any data from the documents.

This example uses the custom mini-batch datastore textDatastore.m. You can adapt this datastore to your data by customizing the functions. For an example showing how to create your own custom mini-batch datastore, see Develop Custom Mini-Batch Datastore (Deep Learning Toolbox).

Load Pretrained Word Embedding

The datastore textDatastore requires a word embedding to convert documents to sequences of vectors. Load a pretrained word embedding using fastTextWordEmbedding. This function requires Text Analytics Toolbox™ Model for fastText English 16 Billion Token Word Embedding support package. If this support package is not installed, then the function provides a download link.

emb = fastTextWordEmbedding;

Create Mini-Batch Datastore of Documents

Create a datastore that contains the data for training. The custom mini-batch datastore textDatastore reads predictors and labels from a CSV file. For the predictors, the datastore converts the documents into sequences of word indices and for the responses, the datastore returns a categorical label for each document.

To create the datastore, first save the custom mini-batch datastore textDatastore.m to the path. For more information about creating custom mini-batch datastores, see Develop Custom Mini-Batch Datastore (Deep Learning Toolbox).

For the training data, specify the CSV file "weatherReportsTrain.csv" and that the text and labels are in the columns "event_narrative" and "event_type" respectively.

filenameTrain = "weatherReportsTrain.csv";
textName = "event_narrative";
labelName = "event_type";
dsTrain = textDatastore(filenameTrain,textName,labelName,emb)
dsTrain = 
  textDatastore with properties:

            ClassNames: [1×39 string]
             Datastore: [1×1 matlab.io.datastore.TransformedDatastore]
    EmbeddingDimension: 300
             LabelName: "event_type"
         MiniBatchSize: 128
            NumClasses: 39
       NumObservations: 19683

Create a datastore containing the validation data from the CSV file "weatherReportsValidation.csv" using the same steps.

filenameValidation = "weatherReportsValidation.csv";
dsValidation = textDatastore(filenameValidation,textName,labelName,emb)
dsValidation = 
  textDatastore with properties:

            ClassNames: [1×39 string]
             Datastore: [1×1 matlab.io.datastore.TransformedDatastore]
    EmbeddingDimension: 300
             LabelName: "event_type"
         MiniBatchSize: 128
            NumClasses: 39
       NumObservations: 4218

Create and Train LSTM Network

Define the LSTM network architecture. To input sequence data into the network, include a sequence input layer and set the input size to the embedding dimension. Next, include an LSTM layer with 180 hidden units. To use the LSTM layer for a sequence-to-label classification problem, set the output mode to 'last'. Finally, add a fully connected layer with output size equal to the number of classes, a softmax layer, and a classification layer.

numFeatures = dsTrain.EmbeddingDimension;
numHiddenUnits = 180;
numClasses = dsTrain.NumClasses;

layers = [ ...
    sequenceInputLayer(numFeatures)
    lstmLayer(numHiddenUnits,'OutputMode','last')
    fullyConnectedLayer(numClasses)
    softmaxLayer
    classificationLayer];

Specify the training options. Specify the solver to be 'adam' and the gradient threshold to be 2. The datastore textDatastore.m does not support shuffling, so set 'Shuffle', to 'never'. For an example showing how to implement a datastore with support for shuffling, see Develop Custom Mini-Batch Datastore (Deep Learning Toolbox). Validate the network once per epoch. To monitor the training progress, set the 'Plots' option to 'training-progress'. To suppress verbose output, set 'Verbose' to false.

By default, trainNetwork uses a GPU if one is available (requires Parallel Computing Toolbox™ and a CUDA® enabled GPU with compute capability 3.0 or higher). Otherwise, it uses the CPU. To specify the execution environment manually, use the 'ExecutionEnvironment' name-value pair argument of trainingOptions. Training on a CPU can take significantly longer than training on a GPU.

miniBatchSize = 128;
numObservations = dsTrain.NumObservations;
numIterationsPerEpoch = floor(numObservations / miniBatchSize);

options = trainingOptions('adam', ...
    'MaxEpochs',15, ...
    'MiniBatchSize',miniBatchSize, ...
    'GradientThreshold',2, ...
    'Shuffle','never', ...
    'ValidationData',dsValidation, ...
    'ValidationFrequency',numIterationsPerEpoch, ...
    'Plots','training-progress', ...
    'Verbose',false);

Train the LSTM network using the trainNetwork function.

net = trainNetwork(dsTrain,layers,options);

Test LSTM Network

Create a datastore containing the documents and the labels

filenameTest = "weatherReportsTest.csv";
dsTest = textDatastore(filenameTest,textName,labelName,emb)
dsTest = 
  textDatastore with properties:

            ClassNames: [1×39 string]
             Datastore: [1×1 matlab.io.datastore.TransformedDatastore]
    EmbeddingDimension: 300
             LabelName: "event_type"
         MiniBatchSize: 128
            NumClasses: 39
       NumObservations: 4217

Read the labels from the datastore using the readLabels function of the custom datastore.

YTest = readLabels(dsTest);

Classify the test documents using the trained LSTM network.

YPred = classify(net,dsTest);

Calculate the classification accuracy. The accuracy is the proportion of labels that the network predicts correctly.

accuracy = mean(YPred == YTest)
accuracy = 0.8084

See Also

| | | | | | | | | |

Related Topics