How to Index Most Common/Popular Str Patterns by Str Length?

signalsandsystemsishard

14 Oct. 2020

0 Respuestas

Actualizado a las 14 Oct. 2020

4 Visualizaciones (30 días)

Iniciar sesión para responder a esta pregunta.

Follow Question

Iniciar sesión para responder a esta pregunta.

Follow Question

Mostrar comentarios más antiguos

Abrir en MATLAB Online

0 votos

In brief, I am attempting to find the most common str patterns across a large str array. Moreover, I wish to identify the most common strings by length. Please see the below example!

Ex.

MyStr = ["hello" "yellow" "teller" "mellow"]

Using MyStr, my desired output is "ellow" using 5 characters; "ello" using 4 characters; "ell" using 3 characters;

Note: "ello" does NOT appear in every word - I am interested only in frequency. If possible, I would prefer to output the 1st, 2nd, 3rd, etc. most popular substrings at each character increment/limit (i.e. 3 character length, 5 character length etc.).

User Paramjeet Panwar suggested the below on a related question, however, histc returns an error: "First Input must be a real non-sparse numeric array."

a = unique(myStr);
n = histc(myStr,a);
[n,idx] = sort(n);
myFreq = a(idx);

2 comentarios
Mostrar Ninguno Ocultar Ninguno

Walter Roberson el 14 de Oct. de 2020

The shorter substrings will always be more common than the longer ones, unless every occurance of the shorter one is part of the longer one.

Should your code be keeping track of substrings of various lengths, and count only the longest applicable substring ?

signalsandsystemsishard el 14 de Oct. de 2020

Thank you for your response.

My immediate request: how can I order the most popular substrings from a list of strings of varying character lengths?

From this list, I suppose I can just segment the output by length. I do understand that shorter substrings will top the frequency list and I am OK with this!

Past questions asked here were solved by assessing the most commonly shared substring across a string array. Essentially, I am interested in popularity instead of commonality (ie. the substring does not need to be present in ALL str indices).

Iniciar sesión para comentar.

Iniciar sesión para responder a esta pregunta.

Follow Question