Contenido principal

Correct Spelling in Documents

This example shows how to correct spelling in documents using Hunspell.

Load Text Data

Create an array of tokenized documents.

str = [
    "Correctly spelled worrds are important for lemmatization."
    "Text Analytics Toolbox providesfunctions for spelling correction."
    "The phrase Higgs boson is a technical term."];
documents = tokenizedDocument(str)
documents = 
  3×1 tokenizedDocument:

    8 tokens: Correctly spelled worrds are important for lemmatization .
    8 tokens: Text Analytics Toolbox providesfunctions for spelling correction .
    9 tokens: The phrase Higgs boson is a technical term .

Correct Spelling

Correct the spelling of the documents using the correctSpelling function.

updatedDocuments = correctSpelling(documents)
updatedDocuments = 
  3×1 tokenizedDocument:

    8 tokens: Correctly spelled words are important for solemnization .
    9 tokens: Text Analytics Toolbox provides functions for spelling correction .
    9 tokens: The phrase Riggs boson is a technical term .

Notice that:

  • The input word "worrds" has been changed to "words".

  • The input word "lemmatization" has been changed to "solemnization".

  • The input word "providesfunctions" has been split into the two words "provides" and "functions".

  • The input word "Higgs" has been changed to "Riggs".

Specify Custom Words

To prevent the software from updating particular words, you can provide a list of known words using the KnownWords name-value argument of the correctSpelling function.

Correct the spelling of the documents again and specify the words "lemmatization" and "Higgs" as known words.

updatedDocuments = correctSpelling(documents,'KnownWords',["lemmatization","Higgs"])
updatedDocuments = 
  3×1 tokenizedDocument:

    8 tokens: Correctly spelled words are important for lemmatization .
    9 tokens: Text Analytics Toolbox provides functions for spelling correction .
    9 tokens: The phrase Higgs boson is a technical term .

Notice here that the words "lemmatization" and "Higgs" remain unchanged.

See Also

|

Topics