topkngrams

Most frequent n-grams

Syntax

tbl = topkngrams(bag)
tbl = topkngrams(bag,k)
tbl = topkngrams(___,Name,Value)

Description

example

tbl = topkngrams(bag) returns a table listing the five most frequently seen n-grams in the bag-of-n-grams model bag.

example

tbl = topkngrams(bag,k) lists the k most frequently seen n-grams in the bag-of-n-grams model bag.

example

tbl = topkngrams(___,Name,Value) specifies additional options using one or more name-value pair arguments.

Examples

collapse all

Create a table of the most frequent bigrams of a bag-of-n-grams model.

Load the example data. The file sonnetsPreprocessed.txt contains preprocessed versions of Shakespeare's sonnets. The file contains one sonnet per line, with words separated by a space. Extract the text from sonnetsPreprocessed.txt, split the text into documents at newline characters, and then tokenize the documents.

filename = "sonnetsPreprocessed.txt";
str = extractFileText(filename);
textData = split(str,newline);
documents = tokenizedDocument(textData);

Create a bag-of-n-grams model.

bag = bagOfNgrams(documents)
bag = 
  bagOfNgrams with properties:

          Counts: [154×8799 double]
      Vocabulary: [1×3092 string]
          Ngrams: [8799×2 string]
    NgramLengths: 2
       NumNgrams: 8799
    NumDocuments: 154

Find the top 5 bigrams.

tbl = topkngrams(bag)
tbl=5×3 table
         Ngram          Count    NgramLength
    ________________    _____    ___________

    "thou"    "art"      34           2     
    "mine"    "eye"      15           2     
    "thy"     "self"     14           2     
    "thou"    "dost"     13           2     
    "mine"    "own"      13           2     

Find the top 10 bigrams.

tbl = topkngrams(bag,10)
tbl=10×3 table
          Ngram          Count    NgramLength
    _________________    _____    ___________

    "thou"    "art"       34           2     
    "mine"    "eye"       15           2     
    "thy"     "self"      14           2     
    "thou"    "dost"      13           2     
    "mine"    "own"       13           2     
    "thy"     "sweet"     12           2     
    "thy"     "love"      11           2     
    "dost"    "thou"      10           2     
    "thou"    "wilt"      10           2     
    "love"    "thee"       9           2     

Load the example data. The file sonnetsPreprocessed.txt contains preprocessed versions of Shakespeare's sonnets. The file contains one sonnet per line, with words separated by a space. Extract the text from sonnetsPreprocessed.txt, split the text into documents at newline characters, and then tokenize the documents.

filename = "sonnetsPreprocessed.txt";
str = extractFileText(filename);
textData = split(str,newline);
documents = tokenizedDocument(textData);

Create a bag-of-n-grams model. To count n-grams of length 2 and 3 (bigrams and trigrams), specify 'NgramLengths' to be the vector [2 3].

bag = bagOfNgrams(documents,'NgramLengths',[2 3])
bag = 
  bagOfNgrams with properties:

          Counts: [154×18022 double]
      Vocabulary: [1×3092 string]
          Ngrams: [18022×3 string]
    NgramLengths: [2 3]
       NumNgrams: 18022
    NumDocuments: 154

View the 10 most common n-grams of length 2 (bigrams).

topkngrams(bag,10,'NGramLengths',2)
ans=10×3 table
             Ngram             Count    NgramLength
    _______________________    _____    ___________

    "thou"    "art"      ""     34           2     
    "mine"    "eye"      ""     15           2     
    "thy"     "self"     ""     14           2     
    "thou"    "dost"     ""     13           2     
    "mine"    "own"      ""     13           2     
    "thy"     "sweet"    ""     12           2     
    "thy"     "love"     ""     11           2     
    "dost"    "thou"     ""     10           2     
    "thou"    "wilt"     ""     10           2     
    "love"    "thee"     ""      9           2     

View the 10 most common n-grams of length 3 (trigrams).

 topkngrams(bag,10,'NGramLengths',3)
ans=10×3 table
               Ngram                Count    NgramLength
    ____________________________    _____    ___________

    "thy"     "sweet"    "self"       4           3     
    "why"     "dost"     "thou"       4           3     
    "thy"     "self"     "thy"        3           3     
    "thou"    "thy"      "self"       3           3     
    "mine"    "eye"      "heart"      3           3     
    "thou"    "shalt"    "find"       3           3     
    "fair"    "kind"     "true"       3           3     
    "thou"    "art"      "fair"       2           3     
    "love"    "thy"      "self"       2           3     
    "thy"     "self"     "thou"       2           3     

Input Arguments

collapse all

Input bag-of-n-grams model, specified as a bagOfNgrams object.

Number of n-grams to return, specified as a positive integer.

Example: 20

Name-Value Pair Arguments

Specify optional comma-separated pairs of Name,Value arguments. Name is the argument name and Value is the corresponding value. Name must appear inside quotes. You can specify several name and value pair arguments in any order as Name1,Value1,...,NameN,ValueN.

Example: 'NgramLengths',[2 3] specifies to return the top bigrams and trigrams.

N-gram lengths, specified as the comma separated pair consisting of 'NgramLengths' and a positive integer or a vector of positive integers.

If you specify NgramLengths, then the function returns n-grams of these lengths only. If you do not specify NgramLengths, then the function returns the top n-grams regardless of length.

Example: [1 2 3]

Indicator for forcing output to be returned as cell array, specified as the comma separated pair consisting of 'ForceCellOutput' and true or false.

Data Types: logical

Output Arguments

collapse all

Table of top n-grams sorted in order of frequency or a cell array of tables.

The table has the following columns:

NgramN-gram specified as a string vector
CountNumber of times the n-gram appears in the bag-of-n-grams model.
NgramLengthLength of the n-gram.

If bag is a non-scalar array or 'ForceCellOutput' is true, then the function returns the outputs as a cell array of tables. Each element in the cell array is a table containing the top n-grams of the corresponding element of bag.

Introduced in R2018a