Prepare Text Data for Analysis

This example shows how to create a function which cleans and preprocesses text data for analysis.

Text data can be large and can contain lots of noise which negatively affects statistical analysis. For example, text data can contain the following:

  • Variations in case, for example "new" and "New"

  • Variations in word forms, for example "walk" and "walking"

  • Words which add noise, for example stop words such as "the" and "of"

  • Punctuation and special characters

  • HTML and XML tags

These word clouds illustrate word frequency analysis applied to some raw text data from weather reports, and a preprocessed version of the same text data.

Load and Extract Text Data

Load the example data. The file weatherReports.csv contains weather reports, including a text description and categorical labels for each event.

filename = "weatherReports.csv";
data = readtable(filename,'TextType','string');

Extract the text data from the field event_narrative, and the label data from the field event_type.

textData = data.event_narrative;
labels = data.event_type;
textData(1:10)
ans = 10×1 string array
    "Large tree down between Plantersville and Nettleton."
    "One to two feet of deep standing water developed on a street on the Winthrop University campus after more than an inch of rain fell in less than an hour. One vehicle was stalled in the water."
    "NWS Columbia relayed a report of trees blown down along Tom Hall St."
    "Media reported two trees blown down along I-40 in the Old Fort area."
    ""
    "A few tree limbs greater than 6 inches down on HWY 18 in Roseland."
    "Awning blown off a building on Lamar Avenue. Multiple trees down near the intersection of Winchester and Perkins."
    "Quarter size hail near Rosemark."
    "Tin roof ripped off house on Old Memphis Road near Billings Drive. Several large trees down in the area."
    "Powerlines down at Walnut Grove and Cherry Lane roads."

Create Tokenized Documents

Create an array of tokenized documents.

cleanedDocuments = tokenizedDocument(textData);
cleanedDocuments(1:10)
ans = 
  10×1 tokenizedDocument:

     8 tokens: Large tree down between Plantersville and Nettleton .
    39 tokens: One to two feet of deep standing water developed on a street on the Winthrop University campus after more than an inch of rain fell in less than an hour . One vehicle was stalled in the water .
    14 tokens: NWS Columbia relayed a report of trees blown down along Tom Hall St .
    14 tokens: Media reported two trees blown down along I-40 in the Old Fort area .
     0 tokens:
    15 tokens: A few tree limbs greater than 6 inches down on HWY 18 in Roseland .
    20 tokens: Awning blown off a building on Lamar Avenue . Multiple trees down near the intersection of Winchester and Perkins .
     6 tokens: Quarter size hail near Rosemark .
    21 tokens: Tin roof ripped off house on Old Memphis Road near Billings Drive . Several large trees down in the area .
    10 tokens: Powerlines down at Walnut Grove and Cherry Lane roads .

Lemmatize the words using normalizeWords. To improve lemmatization, first add part of speech details to the documents using addPartOfSpeechDetails.

cleanedDocuments = addPartOfSpeechDetails(cleanedDocuments);
cleanedDocuments = normalizeWords(cleanedDocuments,'Style','lemma');
cleanedDocuments(1:10)
ans = 
  10×1 tokenizedDocument:

     8 tokens: large tree down between plantersville and nettleton .
    39 tokens: one to two foot of deep standing water develop on a street on the winthrop university campus after more than an inch of rain fall in less than an hour . one vehicle be stall in the water .
    14 tokens: nws columbia relay a report of tree blow down along tom hall st .
    14 tokens: medium report two tree blow down along i-40 in the old fort area .
     0 tokens:
    15 tokens: a few tree limb great than 6 inch down on hwy 18 in roseland .
    20 tokens: awning blow off a building on lamar avenue . multiple tree down near the intersection of winchester and perkins .
     6 tokens: quarter size hail near rosemark .
    21 tokens: tin roof rip off house on old memphis road near billings drive . several large tree down in the area .
    10 tokens: powerlines down at walnut grove and cherry lane road .

Erase the punctuation from the documents.

cleanedDocuments = erasePunctuation(cleanedDocuments);
cleanedDocuments(1:10)
ans = 
  10×1 tokenizedDocument:

     7 tokens: large tree down between plantersville and nettleton
    37 tokens: one to two foot of deep standing water develop on a street on the winthrop university campus after more than an inch of rain fall in less than an hour one vehicle be stall in the water
    13 tokens: nws columbia relay a report of tree blow down along tom hall st
    13 tokens: medium report two tree blow down along i40 in the old fort area
     0 tokens:
    14 tokens: a few tree limb great than 6 inch down on hwy 18 in roseland
    18 tokens: awning blow off a building on lamar avenue multiple tree down near the intersection of winchester and perkins
     5 tokens: quarter size hail near rosemark
    19 tokens: tin roof rip off house on old memphis road near billings drive several large tree down in the area
     9 tokens: powerlines down at walnut grove and cherry lane road

Words like "a", "and", "to", and "the" (known as stop words) can add noise to data. Remove a list of stop words using the removeStopWords function.

cleanedDocuments = removeStopWords(cleanedDocuments);
cleanedDocuments(1:10)
ans = 
  10×1 tokenizedDocument:

     5 tokens: large tree down plantersville nettleton
    18 tokens: two foot deep standing water develop street winthrop university campus inch rain fall less hour vehicle stall water
    10 tokens: nws columbia relay report tree blow down tom hall st
    10 tokens: medium report two tree blow down i40 old fort area
     0 tokens:
    10 tokens: few tree limb great 6 inch down hwy 18 roseland
    13 tokens: awning blow off building lamar avenue multiple tree down near intersection winchester perkins
     5 tokens: quarter size hail near rosemark
    16 tokens: tin roof rip off house old memphis road near billings drive several large tree down area
     7 tokens: powerlines down walnut grove cherry lane road

Remove words with 2 or fewer characters, and words with 15 or greater characters.

cleanedDocuments = removeShortWords(cleanedDocuments,2);
cleanedDocuments = removeLongWords(cleanedDocuments,15);
cleanedDocuments(1:10)
ans = 
  10×1 tokenizedDocument:

     5 tokens: large tree down plantersville nettleton
    18 tokens: two foot deep standing water develop street winthrop university campus inch rain fall less hour vehicle stall water
     9 tokens: nws columbia relay report tree blow down tom hall
    10 tokens: medium report two tree blow down i40 old fort area
     0 tokens:
     8 tokens: few tree limb great inch down hwy roseland
    13 tokens: awning blow off building lamar avenue multiple tree down near intersection winchester perkins
     5 tokens: quarter size hail near rosemark
    16 tokens: tin roof rip off house old memphis road near billings drive several large tree down area
     7 tokens: powerlines down walnut grove cherry lane road

Create Bag-of-Words Model

Create a bag-of-words model.

cleanedBag = bagOfWords(cleanedDocuments)
cleanedBag = 
  bagOfWords with properties:

          Counts: [36176×18469 double]
      Vocabulary: [1×18469 string]
        NumWords: 18469
    NumDocuments: 36176

Remove words that do not appear more than two times in the bag-of-words model.

cleanedBag = removeInfrequentWords(cleanedBag,2)
cleanedBag = 
  bagOfWords with properties:

          Counts: [36176×6974 double]
      Vocabulary: [1×6974 string]
        NumWords: 6974
    NumDocuments: 36176

Some preprocessing steps such as removeInfrequentWords leaves empty documents in the bag-of-words model. To ensure that no empty documents remain in the bag-of-words model after preprocessing, use removeEmptyDocuments as the last step.

Remove empty documents from the bag-of-words model and the corresponding labels from labels.

[cleanedBag,idx] = removeEmptyDocuments(cleanedBag);
labels(idx) = [];
cleanedBag
cleanedBag = 
  bagOfWords with properties:

          Counts: [28137×6974 double]
      Vocabulary: [1×6974 string]
        NumWords: 6974
    NumDocuments: 28137

Create a Preprocessing Function

It can be useful to create a function which performs preprocessing so you can prepare different collections of text data in the same way. For example, you can use a function so that you can preprocess new data using the same steps as the training data.

Create a function which tokenizes and preprocesses the text data so it can be used for analysis. The function preprocessWeatherNarratives, performs the following steps:

  1. Tokenize the text using tokenizedDocument.

  2. Lemmatize the words using normalizeWords.

  3. Erase punctuation using erasePunctuation.

  4. Remove a list of stop words (such as "and", "of", and "the") using removeStopWords.

  5. Remove words with 2 or fewer characters using removeShortWords.

  6. Remove words with 15 or more characters using removeLongWords.

Use the example preprocessing function preprocessWeatherNarratives to prepare the text data.

newText = "A tree is downed outside Apple Hill Drive, Natick";
newDocuments = preprocessWeatherNarratives(newText)
newDocuments = 
  tokenizedDocument:

   7 tokens: tree down outside apple hill drive natick

Compare with Raw Data

Compare the preprocessed data with the raw data.

rawDocuments = tokenizedDocument(textData);
rawBag = bagOfWords(rawDocuments)
rawBag = 
  bagOfWords with properties:

          Counts: [36176×23302 double]
      Vocabulary: [1×23302 string]
        NumWords: 23302
    NumDocuments: 36176

Calculate the reduction in data.

numWordsCleaned = cleanedBag.NumWords;
numWordsRaw = rawBag.NumWords;
reduction = 1 - numWordsCleaned/numWordsRaw
reduction = 0.7007

Compare the raw data and the cleaned data by visualizing the two bag-of-words models using word clouds.

figure
subplot(1,2,1)
wordcloud(rawBag);
title("Raw Data")
subplot(1,2,2)
wordcloud(cleanedBag);
title("Cleaned Data")

Preprocessing Function

The function preprocessWeatherNarratives, performs the following steps in order:

  1. Tokenize the text using tokenizedDocument.

  2. Lemmatize the words using normalizeWords.

  3. Erase punctuation using erasePunctuation.

  4. Remove a list of stop words (such as "and", "of", and "the") using removeStopWords.

  5. Remove words with 2 or fewer characters using removeShortWords.

  6. Remove words with 15 or more characters using removeLongWords.

function documents = preprocessWeatherNarratives(textData)

% Tokenize the text.
documents = tokenizedDocument(textData);

% Lemmatize the words. To improve lemmatization, first use
% addPartOfSpeechDetails.
documents = addPartOfSpeechDetails(documents);
documents = normalizeWords(documents,'Style','lemma');

% Erase punctuation.
documents = erasePunctuation(documents);

% Remove a list of stop words.
documents = removeStopWords(documents);

% Remove words with 2 or fewer characters, and words with 15 or more
% characters.
documents = removeShortWords(documents,2);
documents = removeLongWords(documents,15);

end

See Also

| | | | | | | | | |

Related Topics