Generate Text with Deep Learning "Invalid training data. Labels must not contain undefined values" ERROR

16 visualizaciones (últimos 30 días)
I am using the Generate Text with Deep Learning Matlab example, here
It works fine when I use the Shakespeare text provided in the example, but none of my texts are accepted. I always get the error: "Invalid training data. Labels must not contain undefined values."
My text and code provided below.
filename = 'RWE Nature.txt';
textData = fileread(filename);
textData = replace(textData," ","");
textData = split(textData,[newline]); % USE NEWLINE TO SPLIT TEXT INTO CELLS
% textData = textData(5:2:end);
textData(1:5) % 154 X 1 string array
startOfTextCharacter = compose("\x0002");
whitespaceCharacter = compose("\x00B7");
endOfTextCharacter = compose("\x2403");
newlineCharacter = compose("\x00B6");
textData = startOfTextCharacter + textData;
textData = replace(textData,[" " newline],[whitespaceCharacter newlineCharacter]);
uniqueCharacters = unique([textData{:}]); % '!'(),-.:;?ABCDEFGHIJKLMNOPRSTUVWYabcdefghijklmnopqrstuvwxyz¶·'
numUniqueCharacters = numel(uniqueCharacters); % 62
%
numDocuments = numel(textData); % 154 SONNETS, 89 PARAGRAPHS IN MAYER
XTrain = cell(1,numDocuments);
YTrain = cell(1,numDocuments);
for i = 1:numel(textData)
characters = textData{i};
sequenceLength = numel(characters);
% Get indices of characters.
[~,idx] = ismember(characters,uniqueCharacters);
% Convert characters to vectors.
X = zeros(numUniqueCharacters,sequenceLength);
for j = 1:sequenceLength
X(idx(j),j) = 1;
end
% Create vector of categorical responses with end of text character.
charactersShifted = [cellstr(characters(2:end)')' endOfTextCharacter];
Y = categorical(charactersShifted);
XTrain{i} = X;
YTrain{i} = Y;
end
% textData{1}
inputSize = size(XTrain{1},1);
numHiddenUnits = 200;
numClasses = numel(categories([YTrain{:}]));
layers = [
sequenceInputLayer(inputSize)
lstmLayer(numHiddenUnits,'OutputMode','sequence')
fullyConnectedLayer(numClasses)
softmaxLayer
classificationLayer];
options = trainingOptions('adam', ...
'MaxEpochs',500, ...
'InitialLearnRate',0.01, ...
'GradientThreshold',2, ...
'MiniBatchSize',77,...
'Shuffle','every-epoch', ...
'Plots','training-progress', ...
'Verbose',false);
% Train the network.
'a'
net = trainNetwork(XTrain,YTrain,layers,options);
'b'
% Generate text using the trained network.
generatedText = generateText(net,uniqueCharacters,startOfTextCharacter,newlineCharacter,whitespaceCharacter,endOfTextCharacter)
'end'
function generatedText = generateText(net,uniqueCharacters,startOfTextCharacter,newlineCharacter,whitespaceCharacter,endOfTextCharacter)
numUniqueCharacters = numel(uniqueCharacters);
X = zeros(numUniqueCharacters,1);
idx = strfind(uniqueCharacters,startOfTextCharacter);
X(idx) = 1;
generatedText = "";
vocabulary = string(net.Layers(end).Classes);
maxLength = 500;
while strlength(generatedText) < maxLength
% Predict the next character scores.
[net,characterScores] = predictAndUpdateState(net,X,'ExecutionEnvironment','cpu');
% Sample the next character.
newCharacter = datasample(vocabulary,1,'Weights',characterScores);
% Stop predicting at the end of text.
if newCharacter == endOfTextCharacter
break
end
% Add the character to the generated text.
generatedText = generatedText + newCharacter;
% Create a new vector for the next input.
X(:) = 0;
idx = strfind(uniqueCharacters,newCharacter);
X(idx) = 1;
end
generatedText = replace(generatedText,[newlineCharacter whitespaceCharacter],[newline " "]);
end

Respuestas (1)

Ben
Ben el 28 de Nov. de 2022
There are a few issues to fix this:
  1. The call to Y = categorical(charactersShifted) needs to include a valueset that includes all the unique characters in your dataset, Y = categorical(charactersShifted,allUniqueCharacters)
  2. To make that work with the uniqueCharacters variable you need to convert it to the same class as charactersShifted, a string.
  3. The endOfTextCharacter will need to be included too, otherwise it'll become an <undefined> category in Y.
  4. Finally the logic charactersShifted = [cellstr(characters(2:end)')' endOfTextCharacter]; might prepend an empty "" when characters was only 1 character long. That will make Y have length 2, but X have length 1 and you'll get a sequence length mismatch when you try to train.
I think training should work once you resolve these things. Hope that helps.

Categorías

Más información sobre Modeling and Prediction en Help Center y File Exchange.

Productos


Versión

R2021b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by