Remove stop words from a cell array

I am trying to remove german stop-words from a cell array. Below is the respective code:
if removeStopWords == 1
stop_words = cellstr(stopWords('Language','de'));
split1 = regexp(chatM, '\s','Split');
split1 = cellfun(@(x)strjoin(x), split1, 'Uni', 0);
split1 = cellfun(@(x)convertStringsToChars(x), split1, 'Uni', 0);
chatM = strjoin(split1(~ismember(split1, stop_words)), ' ');
end
I applied a solution from a simillar question on the forum, but it did not work for me properly and I do not understand why.
The problem is, that it removes patterns and not words: imagine you want to remove "Ok" from "Ok Okay OOk". The result will be "ay O", but I want to get "Okay OOk".
By the way, I also tried removeStopWords(str), but the result was the same. In the end I want to count the occurrence of "relevant" words, so I need to remove stop words at first.

 Respuesta aceptada

Cris LaPierre
Cris LaPierre el 24 de Jul. de 2020

0 votos

Do you have the text analytics toolbox? You must because you are using stopWords. Try using the removeStopWords function.

6 comentarios

Sergiu Panainte
Sergiu Panainte el 24 de Jul. de 2020
Yeah, I have. And as I mentioned in the question, I already tried it.
Cris LaPierre
Cris LaPierre el 24 de Jul. de 2020
Editada: Cris LaPierre el 24 de Jul. de 2020
Can you share an example of chatM?
If you haven't yet seen it, this example might be helpful.
Sergiu Panainte
Sergiu Panainte el 24 de Jul. de 2020
Below is the output of first five rows of chatM, just before entering the if-construction.
K>> display(chatM(1:5))
5×1 string array
"hi haben wir gut oder schlecht"
"leider schlecht"
"sicher"
"ja"
"wollen wir das so machen dass jeder gleich viel bekommt "
K>>
Cris LaPierre
Cris LaPierre el 24 de Jul. de 2020
Editada: Cris LaPierre el 24 de Jul. de 2020
Ok, so if I were going to do it, I might do something like this:
doc = tokenizedDocument(lower(chatM));
doc = removeStopWords(doc)
chatM = joinWords(myClean)
The corresponding output is
chatM = 4×1 string
"hi gut schlecht"
"leider schlecht"
"sicher"
"wollen machen gleich viel bekommt"
If you want to count frequency of words, it's better to keep it as separate words (doc). There's not a lot of words yet, but you can obtain a count of the most frequent words this way.
bag = bagOfWords(doc);
topkwords(bag)
ans =
Word Count
__________ _____
"schlecht" 2
"hi" 1
"gut" 1
"leid" 1
"sich" 1
Sergiu Panainte
Sergiu Panainte el 25 de Jul. de 2020
Thanks, generally it solved my problem. But recently I found out that matlab supports parallel computing.
Is there a way to apply it for your sollution? That would be really great, because my original table has like 90k lines..
Cris LaPierre
Cris LaPierre el 25 de Jul. de 2020
I'm not familiar with the requirements of parallel computing. Generally, if the process can be split into pieces without affecting the results (results of each piece are independent of the results of the other pieces), then it technically should be able to be parallelized. If you look at the product requirements for text analytics, you can see that the Parallel Computing Toolbox is recommended, suggesting that it is supported.
Do you have access to the Parallel Computing toolbox? Simplest would be to test. Note that there is an initial load time as the parallel workers start up (~30 seconds?). Test with and without parallelization. You may find it quicker to process the text without it..

Iniciar sesión para comentar.

Más respuestas (0)

Categorías

Etiquetas

Preguntada:

el 24 de Jul. de 2020

Comentada:

el 25 de Jul. de 2020

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by