wordEmbedding

Word embedding model to map words to vectors and back

Description

A word embedding, popularized by the word2vec, GloVe, and fastText libraries, maps words in a vocabulary to real vectors.

The vectors attempt to capture the semantics of the words, so that similar words have similar vectors. Some embeddings also capture relationships between words, such as "king is to queen as man is to woman". In vector form, this relationship is king – man + woman = queen.

Creation

Create a word embedding by loading a pretrained embedding using fastTextWordEmbedding, reading an embedding from a file using readWordEmbedding, or by training an embedding using trainWordEmbedding.

Properties

expand all

`Dimension` — Dimension of word embedding
positive integer

Dimension of the word embedding, specified as a positive integer.

Example: 300

`Vocabulary` — Unique words in model
string vector

Unique words in the model, specified as a string vector.

Data Types: string

Object Functions

`vec2word`	Map embedding vector to word
`word2vec`	Map word to embedding vector
`isVocabularyWord`	Test if word is member of word embedding or encoding
`writeWordEmbedding`	Write word embedding file

Examples

collapse all

Download fastText Support Package

Download and install the Text Analytics Toolbox™ Model for fastText English 16 Billion Token Word Embedding support package.

Type fastTextWordEmbedding at the command line.

fastTextWordEmbedding

If the Text Analytics Toolbox Model for fastText English 16 Billion Token Word Embedding support package is not installed, then the function provides a link to the required support package in the Add-On Explorer. To install the support package, click the link, and then click Install. Check that the installation is successful by typing emb = fastTextWordEmbedding at the command line.

emb = fastTextWordEmbedding

emb = 

  wordEmbedding with properties:

     Dimension: 300
    Vocabulary: [1×1000000 string]

If the required support package is installed, then the function returns a wordEmbedding object.

Map Words to Vectors and Back

Open Live Script

Load a pretrained word embedding using fastTextWordEmbedding. This function requires Text Analytics Toolbox™ Model for fastText English 16 Billion Token Word Embedding support package. If this support package is not installed, then the function provides a download link.

emb = fastTextWordEmbedding

emb = 
  wordEmbedding with properties:

     Dimension: 300
    Vocabulary: [1×1000000 string]

Map the words "Italy", "Rome", and "Paris" to vectors using word2vec.

italy = word2vec(emb,"Italy");
rome = word2vec(emb,"Rome");
paris = word2vec(emb,"Paris");

Map the vector italy - rome + paris to a word using vec2word.

word = vec2word(emb,italy - rome + paris)

word = 
"France"

Convert Documents to Sequences of Word Vectors

Open Live Script

Convert an array of tokenized documents to sequences of word vectors using a pretrained word embedding.

Load a pretrained word embedding using the fastTextWordEmbedding function. This function requires Text Analytics Toolbox™ Model for fastText English 16 Billion Token Word Embedding support package. If this support package is not installed, then the function provides a download link.

emb = fastTextWordEmbedding;

Load the factory reports data and create a tokenizedDocument array.

filename = "factoryReports.csv";
data = readtable(filename,'TextType','string');
textData = data.Description;
documents = tokenizedDocument(textData);

Convert the documents to sequences of word vectors using doc2sequence. The doc2sequence function, by default, left-pads the sequences to have the same length. When converting large collections of documents using a high-dimensional word embedding, padding can require large amounts of memory. To prevent the function from padding the data, set the 'PaddingDirection' option to 'none'. Alternatively, you can control the amount of padding using the 'Length' option.

sequences = doc2sequence(emb,documents,'PaddingDirection','none');

View the sizes of the first 10 sequences. Each sequence is D-by-S matrix, where D is the embedding dimension, and S is the number of word vectors in the sequence.

sequences(1:10)

ans=10×1 cell array
    {300×10 single}
    {300×11 single}
    {300×11 single}
    {300×6  single}
    {300×5  single}
    {300×10 single}
    {300×8  single}
    {300×9  single}
    {300×7  single}
    {300×13 single}

Read Word Embedding from Text File

Open Live Script

Read the example word embedding. This model was derived by analyzing text from Wikipedia.

filename = "exampleWordEmbedding.vec";
emb = readWordEmbedding(filename)

emb = 
  wordEmbedding with properties:

     Dimension: 50
    Vocabulary: ["utc"    "first"    "new"    "two"    "time"    "up"    "school"    "article"    "world"    "years"    "university"    "talk"    "many"    "national"    "later"    "state"    "made"    "born"    "city"    "de"    ...    ] (1x9999 string)

Explore the word embedding using word2vec and vec2word.

king = word2vec(emb,"king");
man = word2vec(emb,"man");
woman = word2vec(emb,"woman");
word = vec2word(emb,king - man + woman)

word = 
"queen"

Write Word Embedding to File

Open Live Script

Train a word embedding and write it to a text file.

Load the example data. The file sonnetsPreprocessed.txt contains preprocessed versions of Shakespeare's sonnets. The file contains one sonnet per line, with words separated by a space. Extract the text from sonnetsPreprocessed.txt, split the text into documents at newline characters, and then tokenize the documents.

filename = "sonnetsPreprocessed.txt";
str = extractFileText(filename);
textData = split(str,newline);
documents = tokenizedDocument(textData);

Train a word embedding using trainWordEmbedding.

emb = trainWordEmbedding(documents)

Training: 100% Loss: 0        Remaining time: 0 hours 0 minutes.

emb = 
  wordEmbedding with properties:

     Dimension: 100
    Vocabulary: ["thy"    "thou"    "love"    "thee"    "doth"    "mine"    "shall"    "eyes"    "sweet"    "time"    "nor"    "beauty"    "yet"    "art"    "heart"    "o"    "thine"    "hath"    "fair"    "make"    "still"    ...    ] (1x401 string)

Write the word embedding to a text file.

filename = "exampleSonnetsEmbedding.vec";
writeWordEmbedding(emb,filename)

Read the word embedding file using readWordEmbedding.

emb = readWordEmbedding(filename)

emb = 
  wordEmbedding with properties:

     Dimension: 100
    Vocabulary: ["thy"    "thou"    "love"    "thee"    "doth"    "mine"    "shall"    "eyes"    "sweet"    "time"    "nor"    "beauty"    "yet"    "art"    "heart"    "o"    "thine"    "hath"    "fair"    "make"    "still"    ...    ] (1x401 string)

Version History

Introduced in R2017b

wordEmbedding

Description

Creation

Properties

`Dimension` — Dimension of word embedding
positive integer

`Vocabulary` — Unique words in model
string vector

Object Functions

Examples

Download fastText Support Package

Map Words to Vectors and Back

Convert Documents to Sequences of Word Vectors

Read Word Embedding from Text File

Write Word Embedding to File

Version History

See Also

Topics

wordEmbedding

Description

Creation

Properties

Dimension — Dimension of word embedding positive integer

Vocabulary — Unique words in model string vector

Object Functions

Examples

Download fastText Support Package

Map Words to Vectors and Back

Convert Documents to Sequences of Word Vectors

Read Word Embedding from Text File

Write Word Embedding to File

Version History

See Also

Topics

`Dimension` — Dimension of word embedding
positive integer

`Vocabulary` — Unique words in model
string vector