How to capture tokens using regular expressions?

Question

Patrick Mboma el 16 de Sept. de 2015

0
Enlazar

Enlace directo a esta pregunta

https://la.mathworks.com/matlabcentral/answers/243437-how-to-capture-tokens-using-regular-expressions

Comentada: Cedric el 19 de Sept. de 2015

Dear all, I would like to capture two parts of a sequence of strings. I would like to call the first part "main" and the second part "digits". The expressions in the strings have a distinct pattern in that they either have ONE underscore or parentheses. What I am looking to capture is the part before the underscore or the opening parenthesis (main) and the part after the underscore or inside the parenthesis (digits). As an example, the typical exercise will be of the form

 expression={'abcd_1','ghsa(22)','gaver_45','fadae(8)'}
 out=regexp(expression,pattern,'name')

The result should be a cell array where each cell contains a structure with fields "main" and "digits". In the first case, for instance, the result should be

main='abcd' and digits='1'.

What I am missing is the right "pattern". Any suggestions?

5 comentarios
Mostrar 3 comentarios más antiguosOcultar 3 comentarios más antiguos

Cedric el 17 de Sept. de 2015

Editada: Cedric el 17 de Sept. de 2015

Abrir en MATLAB Online

Dear Patrick,

In summary, for extracting and validating digits and decimal point, I would would write a pattern like

'(.*?)[\(_]([\d\.]*)'

which explicitly requires the second part to be zero or more * elements of the set [] of digits \d or decimal point \.. Yet, if I wanted to leave validation to STR2DOUBLE, I would extract whatever is in parenthesis or after the underscore:

'(.*?)[\(_]([^\)]*)'

which I translated into zero or more * elements that are not in the set [^] of the literal closing parenthesis. Another way is given by Benjamin where he adds a conditional closing parenthesis.

I also asked about how these strings are defined initially, because the context is important. If you are dealing with a reasonable number of cells, performing pattern matching on a cell array will be efficient enough. If, on the contrary, you have e.g. a 1GB file of entries to process, you may be much more efficient working on it "manually". To illustrate, say the file contains

 name1_45 
 name2(45)
 name2b_32
 name2c(84)
 ..

then you could load it as a char array, replace all '_', '(', ')', new lines, and carriage returns with white spaces, and extract names and contents in one shot with SSCANF or TEXSCAN:

 % - Dummy file content.
 content = sprintf( 'name1_45\nname2(45)\nname2b_32\nname2c(84)\n' ) ;
 % - Flag elements to replace.
 doReplace = content == '_' | content == '(' | content == ')' | content == 10 ;
 % - Replace with with space.
 content(doReplace) = ' ' ;
 % - Parse.
 parsed = textscan( content, '%s %f' ) ;

(10 = ASCII code of new line \n, should also manage 13 for carriage return; may be possible to make it even more efficient using BSXFUN). With that we get

 >> parsed
 parsed = 
    {4x1 cell}    [4x1 double]
 >> parsed{1}
 ans = 
    'name1'
    'name2'
    'name2b'
    'name2c'
 >> parsed{2}
 ans =
    45
    45
    32
    84

Patrick Mboma el 19 de Sept. de 2015

Thanks a lot Cedric!!!

Cedric el 19 de Sept. de 2015

My pleasure!

Iniciar sesión para comentar.

Iniciar sesión para responder a esta pregunta.

Answer 1

Benjamin Kraus el 16 de Sept. de 2015

3
Enlazar

Enlace directo a esta respuesta

https://la.mathworks.com/matlabcentral/answers/243437-how-to-capture-tokens-using-regular-expressions#answer_192653

Abrir en MATLAB Online

expression={'abcd_1','ghsa(22)','gaver_45','fadae(8)'};
pattern = '(?<main>[a-zA-Z]+)(?:[_\(])(?<digits>[0-9]+))?';
out = regexp(expression,pattern,'once','names');

The pattern breaks down like this:

(?<main>[a-zA-Z]+) - A token named "main" with only letters.
(?:[_\(]) - An uncaptured token containing either an underscore or "(".
(?<digits>[0-9]+) - A token named "digits" with only numbers.
)? - An optional ")" character at the end.

The 'once' means to capture the pattern only once per input string. I think in this case you can leave it out.

1 comentario
Mostrar -1 comentarios más antiguosOcultar -1 comentarios más antiguos

Patrick Mboma el 17 de Sept. de 2015

Abrir en MATLAB Online

Dear Benjamin,

Thanks for your input. Your solution would work but would probably need to be refined in the sense that the first part main, may also include some digits. For instance,

whatever345whatever_100

would also be something I would like to capture. It is the second part that would only include digits.

A potential algorithm would be to say everything before an opening parenthesis or an underscore is to be captured in "main", while everything after an underscore or inside parentheses is to be captured in "digits".

Iniciar sesión para comentar.

Answer 2

Kirby Fears el 16 de Sept. de 2015

0
Enlazar

Enlace directo a esta respuesta

https://la.mathworks.com/matlabcentral/answers/243437-how-to-capture-tokens-using-regular-expressions#answer_192648

Abrir en MATLAB Online

This isn't the most efficient or elegant solution, but it solves the problem. Let me know if your data is large enough that this code is slow. I can optimize it.

ex={'abcd_1','ghsa(22)','gaver_45','fadae(8)'};
temp=cellfun(@(s)strsplit(s,{'_','(',')'}),ex,'UniformOutput',false);
ex_main=cellfun(@(s)s{1},temp,'UniformOutput',false);
ex_digit=cellfun(@(s)s{2},temp,'UniformOutput',false);
clear temp;

1 comentario
Mostrar -1 comentarios más antiguosOcultar -1 comentarios más antiguos

Patrick Mboma el 17 de Sept. de 2015

Abrir en MATLAB Online

Dear Kirby,

There are many ways to solve this problem and what you are suggesting is definitely one way to do it. However, I would like to use the elegance of regular expressions and get to practice something I am not very good at yet.

In my current solution for instance, I first use regular expressions to transform all the inputs into the same format

whatever_45

then I look for the underscore, etc. But this entails several lines of codes.

Thanks for your input!

Iniciar sesión para comentar.

How to capture tokens using regular expressions?

5 comentarios
Mostrar 3 comentarios más antiguosOcultar 3 comentarios más antiguos

Respuestas (2)

1 comentario
Mostrar -1 comentarios más antiguosOcultar -1 comentarios más antiguos

1 comentario
Mostrar -1 comentarios más antiguosOcultar -1 comentarios más antiguos

Ver también

Categorías

Etiquetas

Community Treasure Hunt

How to capture tokens using regular expressions?

5 comentarios Mostrar 3 comentarios más antiguosOcultar 3 comentarios más antiguos

Respuestas (2)

1 comentario Mostrar -1 comentarios más antiguosOcultar -1 comentarios más antiguos

1 comentario Mostrar -1 comentarios más antiguosOcultar -1 comentarios más antiguos

Ver también

Categorías

Etiquetas

Community Treasure Hunt

5 comentarios
Mostrar 3 comentarios más antiguosOcultar 3 comentarios más antiguos

1 comentario
Mostrar -1 comentarios más antiguosOcultar -1 comentarios más antiguos

1 comentario
Mostrar -1 comentarios más antiguosOcultar -1 comentarios más antiguos