Word processing: How can I get token numbers from a document?

7 visualizaciones (últimos 30 días)
E
E el 17 de Dic. de 2021
Respondida: Rishabh Singh el 5 de En. de 2022
I'm trying to tokenize a huge document (wikipedia) (so that I can convert the document to word vectors). I want to convert the giant char array into a numeric array of token IDs (indexing into a dictionary I have) in word order. I was able to write code for this using for loops of regexp()'s, but it's taking days and days to run. I see that tokenizedDocument() might be a good alternative, except that I can't figure out how to get the document back as a list of numeric token IDs.
Has anyone successfully tokenized a document in this way? If so, how?
Thanks!
  1 comentario
E
E el 17 de Dic. de 2021
For example, 'a cat ran ...' should be converted to [1,49,34,...] (where cat is the 49th word in the dictionary, etc).

Iniciar sesión para comentar.

Respuestas (1)

Rishabh Singh
Rishabh Singh el 5 de En. de 2022
Hi,
You can use "tokenzedDocument" to tokenize your document. The actual performance will be impacted when you will assign rank number to each token. I would suggest you to use map containter for the purpose.

Categorías

Más información sobre Scripts en Help Center y File Exchange.

Productos


Versión

R2021a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by