textAnalytics toolbox: removing Entity details from documents

3 visualizaciones (últimos 30 días)
I have a very large set of documents that I am preprocessing to use in a bert classification model.
I have tokenized the documents and added the entity details.
Now I want to remove all of the tokenswith in the documents that have been "tagged as" orginisation.
I have the following variables:
documents: tokenized documents
tdetails: a table of tokens with the document number, sentence number, line number, Type, Language, PartOfSpeech and Entity.
Token
"Astoria" 1 2 3 'letters' 'en' 'proper-noun' 'person'
"Federal Savings Bank" 1 2 3 'other' 'en' 'proper-noun' 'organization'
"settled" 1 2 3 'letters' 'en' 'verb' 'non-entity'
How do I remove all of the tokens in the variable documents based on the entity=organisation
eg in documents(1,1).Vocabulary(7) I can find "Federal Savings Bank" which is in row 7 of the example above. I coudl loop through all of the documents and tdetails==organisation but that woudl take quite while
cant seem to figure out how to do this more simply

Respuesta aceptada

Cris LaPierre
Cris LaPierre el 18 de Nov. de 2023
I would use removeWords.
documents = tokenizedDocument(Text(:));
tdetails = tokenDetails(documents) ;
documents2 = removeWords(documents,tdetails{tdetails.Entity=="organisation"});
  1 comentario
david cowan
david cowan el 19 de Nov. de 2023
Movida: Cris LaPierre el 19 de Nov. de 2023
Really appreciate that.
removeWords !!
I'll not forget that now - I knew there had to be a simple approach I was just missing

Iniciar sesión para comentar.

Más respuestas (0)

Categorías

Más información sobre Text Data Preparation en Help Center y File Exchange.

Productos


Versión

R2023b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by