Correct Spelling in Documents
This example shows how to correct spelling in documents using Hunspell.
Load Text Data
Create an array of tokenized documents.
str = [
"Correctly spelled worrds are important for lemmatization."
"Text Analytics Toolbox providesfunctions for spelling correction."
"The phrase Higgs boson is a technical term."];
documents = tokenizedDocument(str)documents =
3×1 tokenizedDocument:
8 tokens: Correctly spelled worrds are important for lemmatization .
8 tokens: Text Analytics Toolbox providesfunctions for spelling correction .
9 tokens: The phrase Higgs boson is a technical term .
Correct Spelling
Correct the spelling of the documents using the correctSpelling function.
updatedDocuments = correctSpelling(documents)
updatedDocuments =
3×1 tokenizedDocument:
8 tokens: Correctly spelled words are important for solemnization .
9 tokens: Text Analytics Toolbox provides functions for spelling correction .
9 tokens: The phrase Riggs boson is a technical term .
Notice that:
The input word "worrds" has been changed to "words".
The input word "lemmatization" has been changed to "solemnization".
The input word "providesfunctions" has been split into the two words "provides" and "functions".
The input word "Higgs" has been changed to "Riggs".
Specify Custom Words
To prevent the software from updating particular words, you can provide a list of known words using the KnownWords name-value argument of the correctSpelling function.
Correct the spelling of the documents again and specify the words "lemmatization" and "Higgs" as known words.
updatedDocuments = correctSpelling(documents,'KnownWords',["lemmatization","Higgs"])
updatedDocuments =
3×1 tokenizedDocument:
8 tokens: Correctly spelled words are important for lemmatization .
9 tokens: Text Analytics Toolbox provides functions for spelling correction .
9 tokens: The phrase Higgs boson is a technical term .
Notice here that the words "lemmatization" and "Higgs" remain unchanged.
See Also
correctSpelling | tokenizedDocument