Main Content

addDocument

Add documents to bag-of-words or bag-of-n-grams model

Description

example

newBag = addDocument(bag,documents) adds documents to the bag-of-words or bag-of-n-grams model bag.

Examples

collapse all

Create a bag-of-words model from an array of tokenized documents.

documents = tokenizedDocument([
    "an example of a short sentence"
    "a second short sentence"]);
bag = bagOfWords(documents)
bag = 
  bagOfWords with properties:

          Counts: [2x7 double]
      Vocabulary: ["an"    "example"    "of"    "a"    "short"    "sentence"    "second"]
        NumWords: 7
    NumDocuments: 2

Create another array of tokenized documents and add it to the same bag-of-words model.

documents = tokenizedDocument([ 
    "a third example of a short sentence" 
    "another short sentence"]);
newBag = addDocument(bag,documents)
newBag = 
  bagOfWords with properties:

          Counts: [4x9 double]
      Vocabulary: ["an"    "example"    "of"    "a"    "short"    "sentence"    "second"    "third"    "another"]
        NumWords: 9
    NumDocuments: 4

If your text data is contained in multiple files in a folder, then you can import the text data into MATLAB using a file datastore.

Create a file datastore for the example sonnet text files. The examples sonnets have file names "exampleSonnetN.txt", where N is the number of the sonnet. Specify the read function to be extractFileText.

readFcn = @extractFileText;
fds = fileDatastore('exampleSonnet*.txt','ReadFcn',readFcn);

Create an empty bag-of-words model.

bag = bagOfWords
bag = 
  bagOfWords with properties:

          Counts: []
      Vocabulary: [1x0 string]
        NumWords: 0
    NumDocuments: 0

Loop over the files in the datastore and read each file. Tokenize the text in each file and add the document to bag.

while hasdata(fds)
    str = read(fds);
    document = tokenizedDocument(str);
    bag = addDocument(bag,document);
end

View the updated bag-of-words model.

bag
bag = 
  bagOfWords with properties:

          Counts: [4x276 double]
      Vocabulary: ["From"    "fairest"    "creatures"    "we"    "desire"    "increase"    ","    "That"    "thereby"    "beauty's"    "rose"    "might"    "never"    "die"    "But"    "as"    "the"    "riper"    "should"    ...    ] (1x276 string)
        NumWords: 276
    NumDocuments: 4

Input Arguments

collapse all

Input bag-of-words or bag-of-n-grams model, specified as a bagOfWords object or a bagOfNgrams object.

Input documents, specified as a tokenizedDocument array, a string array of words, or a cell array of character vectors. If documents is not a tokenizedDocument array, then it must be a row vector representing a single document, where each element is a word. To specify multiple documents, use a tokenizedDocument array.

Output Arguments

collapse all

Output model, returned as a bagOfWords object or a bagOfNgrams object. The type of newBag is the same as the type of bag.

Version History

Introduced in R2017b