Find words common across multiple string cells
10 visualizaciones (últimos 30 días)
Mostrar comentarios más antiguos
Tejas
el 26 de Oct. de 2020
Comentada: Tejas
el 27 de Oct. de 2020
I have a cell array where each cell has a string of different length, and each string is essentially a column of single words. Something like this
words{1,1} = ["sphere";"geometry";"number";"algebra";"function"];
words{1,2} = ["geometry";"equation";"nonlinear";"partial";"function"];
words{1,3} = ["number";"derivative";"function";"topology";"equation";"theory"];
words{1,4} = ["equation";"integral";"geometry";"function";"singular"];
I want to find words which are repeated at least once in a specified number of cells. That is, if I say words common in at least 4 cells, then I should get back
common_words = "function";
If I want words common in at least 3 cells, I should get back
common_words = ["geometry";"function";"equation"];
I can use intersect in a loop (however inefficient that might be) if the words are required to be common in all the cells. However, how do I go about finding intersections of a specific number of cells? As per my understanding, that would require combinations, and it would increase computation time exponentially with increasing cells. Is there an efficient way to do this or would I have to take combinations?
4 comentarios
Stephen23
el 26 de Oct. de 2020
Is the cell array or are the strings particularly large? Would there be any memory issues if they were concatenated or merged together?
Respuesta aceptada
Stephen23
el 27 de Oct. de 2020
Editada: Stephen23
el 27 de Oct. de 2020
My ancient version does not support strings, so I used cell arrays of character vectors, but I would expect that this should work for string as well. Approach: get unique words, concatenate, count using a histogram function:
words{1,1} = {'sphere';'geometry';'number';'algebra';'function'};
words{1,2} = {'geometry';'equation';'nonlinear';'partial';'function'};
words{1,3} = {'number';'derivative';'function';'topology';'equation';'theory'};
words{1,4} = {'equation';'integral';'geometry';'function';'singular'};
tmp = cellfun(@unique,words,'uni',0);
tmp = vertcat(tmp{:});
[uni,~,idx] = unique(tmp);
cnt = histc(idx,1:max(idx));
out = uni(cnt>=3)
Or as a function:
>> fun = @(n) uni(cnt>=n);
>> fun(4)
ans =
'function'
>> fun(3)
ans =
'equation'
'function'
'geometry'
Más respuestas (0)
Ver también
Categorías
Más información sobre Characters and Strings en Help Center y File Exchange.
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!