bagOfWords

Bag-of-words model

Description

A bag-of-words model (also known as a term-frequency counter) records the number of times that words appear in each document of a collection.

bagOfWords does not split text into words. To create an array of tokenized documents, see tokenizedDocument.

Creation

Syntax

bag = bagOfWords

bag = bagOfWords(documents)

bag = bagOfWords(uniqueWords,counts)

Description

bag = bagOfWords creates an empty bag-of-words model.

bag = bagOfWords(documents) counts the words appearing in documents and returns a bag-of-words model.

example

bag = bagOfWords(uniqueWords,counts) creates a bag-of-words model using the words in uniqueWords and the corresponding frequency counts in counts.

example

Input Arguments

expand all

`documents` — Input documents
`tokenizedDocument` array | string array | cell array of character vectors

Input documents, specified as a tokenizedDocument array, a string array of words, or a cell array of character vectors. If documents is not a tokenizedDocument array, then it must be a row vector representing a single document, where each element is a word. To specify multiple documents, use a tokenizedDocument array.

`uniqueWords` — Unique word list
string vector | cell array of character vectors

Unique word list, specified as a string vector or a cell array of character vectors. If uniqueWords contains <missing>, then the function ignores the missing values. The size of uniqueWords must be 1-by-V where V is the number of columns of counts.

Example: ["an" "example" "list"]

Data Types: string | cell

`counts` — Frequency counts of words
matrix of nonnegative integers

Frequency counts of words corresponding to uniqueWords, specified as a matrix of nonnegative integers. The value counts(i,j) corresponds to the number of times the word uniqueWords(j) appears in the ith document.

counts must have numel(uniqueWords) columns.

Properties

expand all

`Counts` — Word counts per document
sparse matrix

Word counts per document, specified as a sparse matrix.

`NumDocuments` — Number of documents seen
nonnegative integer

Number of documents seen, specified as a nonnegative integer.

`NumWords` — Number of unique words in model
nonnegative integer

Number of unique words in the model, specified as a nonnegative integer.

`Vocabulary` — Unique words in model
string vector

Unique words in the model, specified as a string vector.

Data Types: string

Object Functions

`encode`	Encode documents as matrix of word or n-gram counts
`tfidf`	Term Frequency–Inverse Document Frequency (tf-idf) matrix
`topkwords`	Most important words in bag-of-words model or LDA topic
`addDocument`	Add documents to bag-of-words or bag-of-n-grams model
`removeDocument`	Remove documents from bag-of-words or bag-of-n-grams model
`removeEmptyDocuments`	Remove empty documents from tokenized document array, bag-of-words model, or bag-of-n-grams model
`removeWords`	Remove selected words from documents or bag-of-words model
`removeInfrequentWords`	Remove words with low counts from bag-of-words model
`join`	Combine multiple bag-of-words or bag-of-n-grams models
`wordcloud`	Create word cloud chart from text, bag-of-words model, bag-of-n-grams model, or LDA model

Examples

collapse all

Create Bag-of-Words Model

Open Live Script

Load the example data. The file sonnetsPreprocessed.txt contains preprocessed versions of Shakespeare's sonnets. The file contains one sonnet per line, with words separated by a space. Extract the text from sonnetsPreprocessed.txt, split the text into documents at newline characters, and then tokenize the documents.

filename = "sonnetsPreprocessed.txt";
str = extractFileText(filename);
textData = split(str,newline);
documents = tokenizedDocument(textData);

Create a bag-of-words model using bagOfWords.

bag = bagOfWords(documents)

bag = 
  bagOfWords with properties:

        NumWords: 3092
          Counts: [154×3092 double]
      Vocabulary: ["fairest"    "creatures"    "desire"    "increase"    "thereby"    "beautys"    "rose"    "might"    "never"    "die"    "riper"    "time"    "decease"    "tender"    "heir"    "bear"    "memory"    "thou"    …    ] (1×3092 string)
    NumDocuments: 154

View the top 10 words and their total counts.

tbl = topkwords(bag,10)

tbl=10×2 table
     Word      Count
    _______    _____

    "thy"       281 
    "thou"      234 
    "love"      162 
    "thee"      161 
    "doth"       88 
    "mine"       63 
    "shall"      59 
    "eyes"       56 
    "sweet"      55 
    "time"       53

Create Bag-of-Words Model from Unique Words and Counts

Open Live Script

Create a bag-of-words model using a string array of unique words and a matrix of word counts.

uniqueWords = ["a" "an" "another" "example" "final" "sentence" "third"];
counts = [ ...
    1 2 0 1 0 1 0;
    0 0 3 1 0 4 0;
    1 0 0 5 0 3 1;
    1 0 0 1 7 0 0];
bag = bagOfWords(uniqueWords,counts)

bag = 
  bagOfWords with properties:

        NumWords: 7
          Counts: [4×7 double]
      Vocabulary: ["a"    "an"    "another"    "example"    "final"    "sentence"    "third"]
    NumDocuments: 4

Import Text from Multiple Files Using a File Datastore

Open Live Script

If your text data is contained in multiple files in a folder, then you can import the text data into MATLAB using a file datastore.

Create a file datastore for the example sonnet text files. The examples sonnets have file names "exampleSonnetN.txt", where N is the number of the sonnet. Specify the read function to be extractFileText.

readFcn = @extractFileText;
fds = fileDatastore('exampleSonnet*.txt','ReadFcn',readFcn);

Create an empty bag-of-words model.

bag = bagOfWords

bag = 
  bagOfWords with properties:

        NumWords: 0
          Counts: []
      Vocabulary: [1×0 string]
    NumDocuments: 0

Loop over the files in the datastore and read each file. Tokenize the text in each file and add the document to bag.

while hasdata(fds)
    str = read(fds);
    document = tokenizedDocument(str);
    bag = addDocument(bag,document);
end

View the updated bag-of-words model.

bag

bag = 
  bagOfWords with properties:

        NumWords: 276
          Counts: [4×276 double]
      Vocabulary: ["From"    "fairest"    "creatures"    "we"    "desire"    "increase"    ","    "That"    "thereby"    "beauty's"    "rose"    "might"    "never"    "die"    "But"    "as"    "the"    "riper"    "should"    "by"    …    ] (1×276 string)
    NumDocuments: 4

Remove Stop Words from Bag-of-Words Model

Open Live Script

Remove the stop words from a bag-of-words model by inputting a list of stop words to removeWords. Stop words are words such as "a", "the", and "in" which are commonly removed from text before analysis.

documents = tokenizedDocument([
    "an example of a short sentence" 
    "a second short sentence"]);
bag = bagOfWords(documents);
newBag = removeWords(bag,stopWords)

newBag = 
  bagOfWords with properties:

        NumWords: 4
          Counts: [2×4 double]
      Vocabulary: ["example"    "short"    "sentence"    "second"]
    NumDocuments: 2

Most Frequent Words of Bag-of-Words Model

Open Live Script

Create a table of the most frequent words of a bag-of-words model.

filename = "sonnetsPreprocessed.txt";
str = extractFileText(filename);
textData = split(str,newline);
documents = tokenizedDocument(textData);

Create a bag-of-words model using bagOfWords.

bag = bagOfWords(documents)

bag = 
  bagOfWords with properties:

        NumWords: 3092
          Counts: [154×3092 double]
      Vocabulary: ["fairest"    "creatures"    "desire"    "increase"    "thereby"    "beautys"    "rose"    "might"    "never"    "die"    "riper"    "time"    "decease"    "tender"    "heir"    "bear"    "memory"    "thou"    …    ] (1×3092 string)
    NumDocuments: 154

Find the top five words.

T = topkwords(bag);

Find the top 20 words in the model.

k = 20;
T = topkwords(bag,k)

T=20×2 table
      Word      Count
    ________    _____

    "thy"        281 
    "thou"       234 
    "love"       162 
    "thee"       161 
    "doth"        88 
    "mine"        63 
    "shall"       59 
    "eyes"        56 
    "sweet"       55 
    "time"        53 
    "beauty"      52 
    "nor"         52 
    "art"         51 
    "yet"         51 
    "o"           50 
    "heart"       50 
      ⋮

Create Tf-idf Matrix

Open Live Script

Create a Term Frequency–Inverse Document Frequency (tf-idf) matrix from a bag-of-words model.

filename = "sonnetsPreprocessed.txt";
str = extractFileText(filename);
textData = split(str,newline);
documents = tokenizedDocument(textData);

Create a bag-of-words model using bagOfWords.

bag = bagOfWords(documents)

bag = 
  bagOfWords with properties:

        NumWords: 3092
          Counts: [154×3092 double]
      Vocabulary: ["fairest"    "creatures"    "desire"    "increase"    "thereby"    "beautys"    "rose"    "might"    "never"    "die"    "riper"    "time"    "decease"    "tender"    "heir"    "bear"    "memory"    "thou"    …    ] (1×3092 string)
    NumDocuments: 154

Create a tf-idf matrix. View the first 10 rows and columns.

M = tfidf(bag);
full(M(1:10,1:10))

ans = 10×10

    3.6507    4.3438    2.7344    3.6507    4.3438    2.2644    3.2452    3.8918    2.4720    2.5520
         0         0         0         0         0    4.5287         0         0         0         0
         0         0         0         0         0         0         0         0         0    2.5520
         0         0         0         0         0    2.2644         0         0         0         0
         0         0         0         0         0    2.2644         0         0         0         0
         0         0         0         0         0    2.2644         0         0         0         0
         0         0         0         0         0         0         0         0         0         0
         0         0         0         0         0         0         0         0         0         0
         0         0         0         0         0    2.2644         0         0         0    2.5520
         0         0    2.7344         0         0         0         0         0         0         0

Create Word Cloud from Bag-of-Words Model

Open Live Script

filename = "sonnetsPreprocessed.txt";
str = extractFileText(filename);
textData = split(str,newline);
documents = tokenizedDocument(textData);

Create a bag-of-words model using bagOfWords.

bag = bagOfWords(documents)

bag = 
  bagOfWords with properties:

          Counts: [154×3092 double]
      Vocabulary: ["fairest"    "creatures"    "desire"    "increase"    "thereby"    "beautys"    "rose"    "might"    "never"    "die"    "riper"    "time"    "decease"    "tender"    "heir"    "bear"    "memory"    "thou"    "contracted"    …    ]
        NumWords: 3092
    NumDocuments: 154

Visualize the bag-of-words model using a word cloud.

figure
wordcloud(bag);

Figure contains an object of type wordcloud.

Create Bag-of-Words Model in Parallel

Open Live Script

If your text data is contained in multiple files in a folder, then you can import the text data and create a bag-of-words model in parallel using parfor. If you have Parallel Computing Toolbox™ installed, then the parfor loop runs in parallel, otherwise, it runs in serial. Use join to combine an array of bag-of-words models into one model.

Create a list of filenames. The examples sonnets have file names "exampleSonnetN.txt", where N is the number of the sonnet.

filenames = [
    "exampleSonnet1.txt"
    "exampleSonnet2.txt"
    "exampleSonnet3.txt"
    "exampleSonnet4.txt"];

Create a bag-of-words model from a collection of files. Initialize an empty bag-of-words model and then loop over the files and create a bag-of-words model for each file.

bag = bagOfWords;

numFiles = numel(filenames);
parfor i = 1:numFiles
    filename = filenames(i);
    
    textData = extractFileText(filename);
    document = tokenizedDocument(textData);
    bag(i) = bagOfWords(document);
end

Combine the bag-of-words models using join.

bag = join(bag)

bag = 
  bagOfWords with properties:

        NumWords: 276
          Counts: [4×276 double]
      Vocabulary: ["From"    "fairest"    "creatures"    "we"    "desire"    "increase"    ","    "That"    "thereby"    "beauty's"    "rose"    "might"    "never"    "die"    "But"    "as"    "the"    "riper"    "should"    "by"    …    ] (1×276 string)
    NumDocuments: 4

Tips

If you intend to use a held out test set for your work, then partition your text data before using bagOfWords. Otherwise, the bag-of-words model may bias your analysis.

Version History

Introduced in R2017b

bagOfWords

Description

Creation

Syntax

Description

Input Arguments

`documents` — Input documents
`tokenizedDocument` array | string array | cell array of character vectors

`uniqueWords` — Unique word list
string vector | cell array of character vectors

`counts` — Frequency counts of words
matrix of nonnegative integers

Properties

`Counts` — Word counts per document
sparse matrix

`NumDocuments` — Number of documents seen
nonnegative integer

`NumWords` — Number of unique words in model
nonnegative integer

`Vocabulary` — Unique words in model
string vector

Object Functions

Examples

Create Bag-of-Words Model

Create Bag-of-Words Model from Unique Words and Counts

Import Text from Multiple Files Using a File Datastore

Remove Stop Words from Bag-of-Words Model

Most Frequent Words of Bag-of-Words Model

Create Tf-idf Matrix

Create Word Cloud from Bag-of-Words Model

Create Bag-of-Words Model in Parallel

Tips

Version History

See Also

Topics

bagOfWords

Description

Creation

Syntax

Description

Input Arguments

documents — Input documents tokenizedDocument array | string array | cell array of character vectors

uniqueWords — Unique word list string vector | cell array of character vectors

counts — Frequency counts of words matrix of nonnegative integers

Properties

Counts — Word counts per document sparse matrix

NumDocuments — Number of documents seen nonnegative integer

NumWords — Number of unique words in model nonnegative integer

Vocabulary — Unique words in model string vector

Object Functions

Examples

Create Bag-of-Words Model

Create Bag-of-Words Model from Unique Words and Counts

Import Text from Multiple Files Using a File Datastore

Remove Stop Words from Bag-of-Words Model

Most Frequent Words of Bag-of-Words Model

Create Tf-idf Matrix

Create Word Cloud from Bag-of-Words Model

Create Bag-of-Words Model in Parallel

Tips

Version History

See Also

Topics

`documents` — Input documents
`tokenizedDocument` array | string array | cell array of character vectors

`uniqueWords` — Unique word list
string vector | cell array of character vectors

`counts` — Frequency counts of words
matrix of nonnegative integers

`Counts` — Word counts per document
sparse matrix

`NumDocuments` — Number of documents seen
nonnegative integer

`NumWords` — Number of unique words in model
nonnegative integer

`Vocabulary` — Unique words in model
string vector