Converting rough strings to exact strings

Question

0 votos

I have a string array of filenames which are names in an semi-consistent manner, e.g.:

AllFiles
AllFiles = 
    4x1 string array
        "textIdontCareAbout_Phenolic32_Group5_textIdontCareAbout"
        "textIdontCareAbout_P1_textIdontCareAbout"
        "textIdontCareAbout_Epx2_G3_textIdontCareAbout"
        "textIdontCareAbout_Epoxy_105_textIdontCareAbout"

Im trying to figure out how to extract & convert the inconsistent substrings of interest (the stuff between "textIdontCareAbout") into a consistent format, e.g.:

AllFiles
AllFiles = 
    4x1 string array
        "P32G5"
        "P1"
        "E2G3"
        "E105"

I had been avoiding using regexp, but having caved and decided to work with that, I'm trying to figure out an elegant way to do this conversion. At present the only thing I can see working is manually checking for each possible phrasing style I see when manualy searching through the data I have at present.

Is there a better way to go about this, or even just some suggestions to how to define the regexp in a way to have as few searches as possible?

4 comentarios
Mostrar 2 comentarios más antiguos Ocultar 2 comentarios más antiguos

Paul el 8 de Sept. de 2022

Abrir en MATLAB Online

There may be a way to do this with string operations. Hard to tell w/o knowing the rule(s) to apply for what to keep and what to discard from a single string. For example, supposing that @the cyclist has the correct rules, one could do

AllFiles = [
        "textIdontCareAbout_Phenolic32_Group5_textIdontCareAbout"
        "textIdontCareAbout_P1_textIdontCareAbout"
        "textIdontCareAbout_Epx2_G3_textIdontCareAbout"
        "textIdontCareAbout_Epoxy_105_textIdontCareAbout"]
AllFiles = 4×1 string array
    "textIdontCareAbout_Phenolic32_Group5_textIdontCareAbout"
    "textIdontCareAbout_P1_textIdontCareAbout"
    "textIdontCareAbout_Epx2_G3_textIdontCareAbout"
    "textIdontCareAbout_Epoxy_105_textIdontCareAbout"
AllFiles = extractAfter(AllFiles,"_")
AllFiles = 4×1 string array
    "Phenolic32_Group5_textIdontCareAbout"
    "P1_textIdontCareAbout"
    "Epx2_G3_textIdontCareAbout"
    "Epoxy_105_textIdontCareAbout"
AllFiles = reverse(extractAfter(reverse(AllFiles),"_"))
AllFiles = 4×1 string array
    "Phenolic32_Group5"
    "P1"
    "Epx2_G3"
    "Epoxy_105"
upperchars = isstrprop(AllFiles,'upper')
upperchars = 4×1 cell array
    {[1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0]}
    {[                              1 0]}
    {[                    1 0 0 0 0 1 0]}
    {[                1 0 0 0 0 0 0 0 0]}
digitchars = isstrprop(AllFiles,'digit')
digitchars = 4×1 cell array
    {[0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 1]}
    {[                              0 1]}
    {[                    0 0 0 1 0 0 1]}
    {[                0 0 0 0 0 0 1 1 1]}
AllFiles = arrayfun(@(a,b,c)(string(a{1}(find(b{:} | c{:})))),cellstr(AllFiles),upperchars,digitchars)
AllFiles = 4×1 string array
    "P32G5"
    "P1"
    "E2G3"
    "E105"

IDK, maybe regexp will be better/easier (I've never been able to get my head wrapped around regular expressions and patterns).

Gabriel Stanley el 8 de Sept. de 2022

...I feel bad, because in the time you've developed this solution I also managed to develop the appropriate regexp for the Phenolic and Epoxy groups, and am now going figuring our how to fuse the two resulting arrays.

Iniciar sesión para comentar.

Iniciar sesión para responder a esta pregunta.

Follow Question

Answer 1

Stephen23 el 8 de Sept. de 2022

Abrir en MATLAB Online

0 votos

S = [...
    "textIdontCareAbout_Phenolic32_Group5_textIdontCareAbout"
    "textIdontCareAbout_P1_textIdontCareAbout"
    "textIdontCareAbout_Epx2_G3_textIdontCareAbout"
    "textIdontCareAbout_Epoxy_105_textIdontCareAbout"];
T = regexp(S,'_.+_','match','once');
T = regexprep(T,'[^A-Z\d]','')
T = 4×1 string array
    "P32G5"
    "P1"
    "E2G3"
    "E105"

3 comentarios
Mostrar 1 comentario más antiguo Ocultar 1 comentario más antiguo

Gabriel Stanley el 8 de Sept. de 2022

Abrir en MATLAB Online

So, as I noted in the comments (realizing I should've gone and edited the original question to include this), that "textIdontCareAbout" includes both letter/number pairings and underscores (which are used as whitespace characters). The following expression captures all the text I'm interested in:

AllFiles = [...
        "textIdontCareAbout_Phenolic32_Group5_textIdontCareAbout"
        "textIdontCareAbout_P1_textIdontCareAbout"
        "textIdontCareAbout_Epx2_G3_textIdontCareAbout"
        "textIdontCareAbout_Epoxy_105_textIdontCareAbout"];
SingleExpression = {'(E(poxy|px)?|P(hen(olic)?)?)(_)?\d{1,3}(_G(rp|roup)?(_)?\d{1,2})?'};
temp = regexpi(AllFiles,SingleExpression,'match')
temp = 4×1 cell array
    {["Phenolic32_Group5"]}
    {["P1"               ]}
    {["Epx2_G3"          ]}
    {["Epoxy_105"        ]}

However when trying to run a second invocation of regexp on in resulting array temp, MatLab threw the fault

temp2 = regexprep(temp,'[^A-Z\d]','')
Error using regexprep
All cells must be char row vectors.

The solution to the above was the have an intermediate cellstr operation on temp:

temp2 = cellstr(temp);
temp3 = regexprep(temp2,'[^A-Z\d]','')

Stephen23 el 8 de Sept. de 2022

Editada: Stephen23 el 8 de Sept. de 2022

Abrir en MATLAB Online

Why are you nesting this character vector in a superfluous scalar cell array?:

SingleExpression = {'(E(poxy|px)?|P(hen(olic)?)?)(_)?\d{1,3}(_G(rp|roup)?(_)?\d{1,2})?'};

"However when trying to run a second invocation of regexp on in resulting array temp, MatLab threw the fault... The solution to the above was the have an intermediate cellstr operation on temp:"

The error occurs because you did not use the ONCE option, as shown in my answer, so your code adds an extra layer of nested cell arrays. Rather than adding extra commands (e.g. CELLSTR) the simple and efficient solution is to specify the ONCE option, just as I showed you:

temp = regexpi(AllFiles,SingleExpression,'match','once')
%                                                ^^^^^^ simply specify this

Gabriel Stanley el 8 de Sept. de 2022

The cell brackets {} are because I forgot I don't need them for string arrays. And I've taken your direction and added the 'once' option to the regexpi call. Thank you for your help.

Iniciar sesión para comentar.

Converting rough strings to exact strings

4 comentarios
Mostrar 2 comentarios más antiguos Ocultar 2 comentarios más antiguos

Respuesta aceptada

3 comentarios
Mostrar 1 comentario más antiguo Ocultar 1 comentario más antiguo

Más respuestas (0)

Categorías

Productos

Versión

Etiquetas

Community Treasure Hunt

Converting rough strings to exact strings

4 comentarios Mostrar 2 comentarios más antiguos Ocultar 2 comentarios más antiguos

Respuesta aceptada

3 comentarios Mostrar 1 comentario más antiguo Ocultar 1 comentario más antiguo

Más respuestas (0)

Categorías

Productos

Versión

Etiquetas

Ver también

Community Treasure Hunt

4 comentarios
Mostrar 2 comentarios más antiguos Ocultar 2 comentarios más antiguos

3 comentarios
Mostrar 1 comentario más antiguo Ocultar 1 comentario más antiguo