textAnalytics toolbox: removing Entity details from documents
3 visualizaciones (últimos 30 días)
Mostrar comentarios más antiguos
david cowan
el 18 de Nov. de 2023
Movida: Cris LaPierre
el 19 de Nov. de 2023
I have a very large set of documents that I am preprocessing to use in a bert classification model.
I have tokenized the documents and added the entity details.
Now I want to remove all of the tokenswith in the documents that have been "tagged as" orginisation.
I have the following variables:
documents: tokenized documents
tdetails: a table of tokens with the document number, sentence number, line number, Type, Language, PartOfSpeech and Entity.
Token
"Astoria" 1 2 3 'letters' 'en' 'proper-noun' 'person'
"Federal Savings Bank" 1 2 3 'other' 'en' 'proper-noun' 'organization'
"settled" 1 2 3 'letters' 'en' 'verb' 'non-entity'
How do I remove all of the tokens in the variable documents based on the entity=organisation
eg in documents(1,1).Vocabulary(7) I can find "Federal Savings Bank" which is in row 7 of the example above. I coudl loop through all of the documents and tdetails==organisation but that woudl take quite while
cant seem to figure out how to do this more simply
0 comentarios
Respuesta aceptada
Cris LaPierre
el 18 de Nov. de 2023
documents = tokenizedDocument(Text(:));
tdetails = tokenDetails(documents) ;
documents2 = removeWords(documents,tdetails{tdetails.Entity=="organisation"});
1 comentario
Más respuestas (0)
Ver también
Categorías
Más información sobre Text Data Preparation en Help Center y File Exchange.
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!