Word processing: How can I get token numbers from a document?

Question

E el 17 de Dic. de 2021

0
Enlazar

Enlace directo a esta pregunta

https://la.mathworks.com/matlabcentral/answers/1613345-word-processing-how-can-i-get-token-numbers-from-a-document

Respondida: Rishabh Singh el 5 de En. de 2022

I'm trying to tokenize a huge document (wikipedia) (so that I can convert the document to word vectors). I want to convert the giant char array into a numeric array of token IDs (indexing into a dictionary I have) in word order. I was able to write code for this using for loops of regexp()'s, but it's taking days and days to run. I see that tokenizedDocument() might be a good alternative, except that I can't figure out how to get the document back as a list of numeric token IDs.

Has anyone successfully tokenized a document in this way? If so, how?

Thanks!

1 comentario
Mostrar -1 comentarios más antiguosOcultar -1 comentarios más antiguos

E el 17 de Dic. de 2021

For example, 'a cat ran ...' should be converted to [1,49,34,...] (where cat is the 49th word in the dictionary, etc).

Iniciar sesión para comentar.

Iniciar sesión para responder a esta pregunta.

Answer 1

Rishabh Singh el 5 de En. de 2022

0
Enlazar

Enlace directo a esta respuesta

https://la.mathworks.com/matlabcentral/answers/1613345-word-processing-how-can-i-get-token-numbers-from-a-document#answer_868400

Hi,

You can use "tokenzedDocument" to tokenize your document. The actual performance will be impacted when you will assign rank number to each token. I would suggest you to use map containter for the purpose.

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

Iniciar sesión para comentar.

Word processing: How can I get token numbers from a document?

1 comentario
Mostrar -1 comentarios más antiguosOcultar -1 comentarios más antiguos

Respuestas (1)

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

Ver también

Categorías

Etiquetas

Productos

Versión

Community Treasure Hunt

Word processing: How can I get token numbers from a document?

1 comentario Mostrar -1 comentarios más antiguosOcultar -1 comentarios más antiguos

Respuestas (1)

0 comentarios Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos

Ver también

Categorías

Etiquetas

Productos

Versión

Community Treasure Hunt

1 comentario
Mostrar -1 comentarios más antiguosOcultar -1 comentarios más antiguos

0 comentarios
Mostrar -2 comentarios más antiguosOcultar -2 comentarios más antiguos