How, if possible, do I limit the number of times REGEXP searches for a specific pattern?

Question

2 votos

I’m using a regular expression to search blocks of text that look like the following;

MSN_BER (0:31) Observation #1 Rx'd at:  (58570.000) Msg. Time:  (58568.000)
  Forward to IMU: true   Rcv Date: 2010121   Synch: f0f0   Rep Mode: Replay_Mode
  State Time:            12:00:00.000   (58571.000)
  State Position:       -1500.0000, -5000.0000, 4100.0000
MSN_RAM (0:32) Observation #20 Rx'd at:  (58569.000) Msg. Time:  (58569.000)
  Forward to IMU: true   Rcv Date: 2010121   Synch: f0f0   Rep Mode: Replay_Mode
  Fmt: 10 (AIRBORN__ARRAY_LOT)  Length: 5678   Remote Num: 1   Number of Obsevations: 1
Type: 1 Track ID: 12345 Time Tag: 58573.00000000
   Band ID: 1   AD ID:   21 Scan ID: 0  LRT/HRT: 1  Valid Flag: 0
MSN_RAM (0:32) Observation #30 Rx'd at:  (58569.000) Msg. Time:  (58569.000)
  Forward to IMU: true   Rcv Date: 2010121   Synch: f0f0   Rep Mode: Replay_Mode
  Fmt: 10 (AIRBORN__ARRAY_LOT)  Length: 5678   Remote Num: 1   Number of Obsevations: 2
Type: 1 Track ID: 12345 Time Tag: 58583.00000000
   Band ID: 1   AD ID:   31 Scan ID: 0  LRT/HRT: 1  Valid Flag: 0
Type: 1 Track ID: 12345 Time Tag: 58585.00000000
   Band ID: 1   AD ID:   32 Scan ID: 0  LRT/HRT: 1  Valid Flag: 0

Note: There is no 2nd MSN_BER data block.

I’m using the following search pattern and REGEXP function to extract the time tag and AD ID values:

exp = '([\d\.]+)\s+Band[^A]+?AD ID:\s+(\d+).';
tokens3 = regexp(bufferSplit{BlockId}, exp, 'tokens');

This results in: tokens3 = {1x2 cell} {1x2 cell} {1x2 cell},

where the time tag and AD ID are contained in the cells for each occurrence in the block of text.

>> tokens3{1,1}

ans = '58573.00000000' '21'

>> tokens3{1,2}

ans = '58583.00000000' '31'

>> tokens3{1,3}

ans = '58585.00000000' '32'

What I’m attempting to accomplish is limit the search pattern. Specifically, limit the number of times to search for the time tag and AD ID values based on the fact that there is no 2nd MSN_BER data block. I know the command option 'once' will return only the first match found. However, there could be multiple occurrences of the AD ID and its associated time tag.

The result of this would be: tokens3 = {1x2 cell}

>> tokens3{1,1}

ans = '58573.00000000' '21'

Can this be accomplished using the REGEXP function?

3 comentarios
Mostrar 1 comentario más antiguo Ocultar 1 comentario más antiguo

Brad el 15 de Nov. de 2013

Abrir en MATLAB Online

Cedric, it is based on the file contents.

The reason I'm wanting to limit the search is because the text files that contain these data blocks are not consistent (Some times the MSN_BER blocks of data are missing).

Normally, the data blocks populate the text file in this order:

MSN_BER

MSN_RAM

Type: - this block of data could occur between 1 and several hundred times

MSN_BER

MSN_RAM

Type: - this block of data could occur between 1 and several hundred times

. . . .

. . .

So what I have to do is account for the possibility of missing MSN_BER blocks by not parsing out and saving all of the associated Time Tag and AD ID values (within the Type: blocks) - for each missing MSN_BER. I'm finding this to be quite tedious do to the fact that the Type: blocks of data could occur a varying number of times.

I took Walter's advice and rebuilt my expression to include a look ahead assertion:

exp = '([\d\.]+)\s+Band[^A]+?AD ID:\s+(\d+).+\w*(?=MSN)';

It worked great for the first test - where only a single block of Type: data is present. However, when multiple blocks of Type: data are present, it only accounts for the first occurence.

Cedric el 16 de Nov. de 2013

Abrir en MATLAB Online

So you have a situation like the following?

 MSN_BER
 ...
 MSN_RAM
 ...
 Type: - this block of data could occur between 1 and several hundred times
 MSN_RAM ** No MSN_BER, so Type entries should be discarded.
 ...
 Type: - this block of data could occur between 1 and several hundred times
 MSN_BER
 ...
 MSN_RAM
 ...
 Type: - this block of data could occur between 1 and several hundred times

If, so, what do you want to achieve? Is it to get a stat on time of all types which belong to any MSN_BER, or is it a stat per MSN_BER, or anything else?

Iniciar sesión para comentar.

Iniciar sesión para responder a esta pregunta.

Follow Question

Answer 1

Cedric el 16 de Nov. de 2013

Editada: Cedric el 17 de Nov. de 2013

Abrir en MATLAB Online

1 voto

I'll answer assuming that my last comment under your question is correct. It is nice to implement complex regular expressions for learning, but in practice one often gets better results by splitting a one shot complex call/pattern into a series of simpler calls/patterns. Here is an example: I am using the following content:

 MSN_BER (0:31) Observation #1 Rx'd at:  (58570.000) Msg. Time:  (58568.000)
  Forward to IMU: true   Rcv Date: 2010121   Synch: f0f0   Rep Mode: Replay_Mode
  State Time:            12:00:00.000   (58571.000)
  State Position:       -1500.0000, -5000.0000, 4100.0000
 MSN_RAM (0:32) Observation #20 Rx'd at:  (58569.000) Msg. Time:  (58569.000)
  Forward to IMU: true   Rcv Date: 2010121   Synch: f0f0   Rep Mode: Replay_Mode
  Fmt: 10 (AIRBORN__ARRAY_LOT)  Length: 5678   Remote Num: 1   Number of Obsevations: 1
 Type: 1 Track ID: 12345 Time Tag: 58573.00000000
   Band ID: 1   AD ID:   21 Scan ID: 0  LRT/HRT: 1  Valid Flag: 0
 Type: 1 Track ID: 12345 Time Tag: 58574.00000000
   Band ID: 1   AD ID:   21 Scan ID: 0  LRT/HRT: 1  Valid Flag: 0
 MSN_RAM (0:32) Observation #30 Rx'd at:  (58569.000) Msg. Time:  (58569.000)
  Forward to IMU: true   Rcv Date: 2010121   Synch: f0f0   Rep Mode: Replay_Mode
  Fmt: 10 (AIRBORN__ARRAY_LOT)  Length: 5678   Remote Num: 1   Number of Obsevations: 2
 Type: 1 Track ID: 12345 Time Tag: 58583.00000000
   Band ID: 1   AD ID:   31 Scan ID: 0  LRT/HRT: 1  Valid Flag: 0
 Type: 1 Track ID: 12345 Time Tag: 58585.00000000
   Band ID: 1   AD ID:   32 Scan ID: 0  LRT/HRT: 1  Valid Flag: 0
 MSN_BER (0:31) Observation #1 Rx'd at:  (58570.000) Msg. Time:  (58568.000)
  Forward to IMU: true   Rcv Date: 2010121   Synch: f0f0   Rep Mode: Replay_Mode
  State Time:            12:00:00.000   (58571.000)
  State Position:       -1500.0000, -5000.0000, 4100.0000
 MSN_RAM (0:32) Observation #20 Rx'd at:  (58569.000) Msg. Time:  (58569.000)
  Forward to IMU: true   Rcv Date: 2010121   Synch: f0f0   Rep Mode: Replay_Mode
  Fmt: 10 (AIRBORN__ARRAY_LOT)  Length: 5678   Remote Num: 1   Number of Obsevations: 1
 Type: 1 Track ID: 12345 Time Tag: 58578.00000000
   Band ID: 1   AD ID:   41 Scan ID: 0  LRT/HRT: 1  Valid Flag: 0
 Type: 1 Track ID: 12345 Time Tag: 58579.00000000
   Band ID: 1   AD ID:   41 Scan ID: 0  LRT/HRT: 1  Valid Flag: 0

which is made of two MSN_BER/MSN_RAM blocks framing an MSN_RAM only block. I assume that you want to get all AD IDs and time tags of MSN_BER/MSN_RAM blocks.

The first step is to read the file and get valid MSN_BER/MSN_RAM blocks:

 content = fileread( 'bradFile.txt' ) ;
 BER_blocks = regexp( content, 'MSN_BER.+?RAM(?:[^R]+|R(?!AM))*', 'match' ) ;

Running this produces..

 >> BER_blocks
 BER_blocks = 
    [1x766 char]    [1x762 char]

If you display these two blocks, you'll see that the first doesn't include the MSN_RAM block. The first part of the pattern is trivial, and the second part matches all characters which are not 'R' or all 'R''s not followed by 'AM'. This is one (not too inefficient) way to exclude a given string from the match.

The second step is to extract AD IDs and time tags from each block.

 data = cell( size( BER_blocks )) ;
 for bId = 1 : numel( BER_blocks )
    tokens = regexp( BER_blocks{bId}, 'Time Tag:\s*([\d\.]+).+?AD ID:\s*(\d+)', ...
                     'tokens' ) ;
    data{bId} = reshape( str2double( [tokens{:}] ), 2, [] ).' ;
 end

Which leads, based on the above content, to the following data cell array (each cell contains time tag and AD ID of one MSN_BER/MSN_RAM block) ..

 >> celldisp( data )
 data{1} =
        58573          21
        58574          21
 data{2} =
        58578          41
        58579          41

You can then concatenate these cells' content if you want to have one big array instead of one array per block:

 >> data = vertcat( data{:} )
 data =
       58573          21
       58574          21
       58578          41
       58579          41

Let me know if it's not what you wanted.

2 comentarios
Mostrar Ninguno Ocultar Ninguno

Brad el 18 de Nov. de 2013

Cedric, this appears to work just fine! In hindsight I should have been paying closer attention to my data. I've got a long ways to go with these regular expressions. Thanks for taking a look at this.

Cedric el 18 de Nov. de 2013

You're welcome. And we actually all have a long way to go with these regular expressions, so I sympathize!

Iniciar sesión para comentar.

Answer 2

Walter Roberson el 12 de Nov. de 2013

Abrir en MATLAB Online

2 votos

After a pattern, perhaps enclosed in () or (?:), you can put {minimum,maximum} counts. For example

'(?:\d\w){3,7}'

would match 3, 4, 5, 6, or 7 occurrences of \d\w repeated.

7 comentarios
Mostrar 5 comentarios más antiguos Ocultar 5 comentarios más antiguos

Walter Roberson el 14 de Nov. de 2013

Abrir en MATLAB Online

Sorry, the look-ahead should be ?= rather than ?:

'((?:\w+=).*?)(?=MSN_RAM)'

?: is an example of a Grouping Operator and ?= is an example of a Lookaround Assertion

The \w+= was just a sample pattern I tossed in for illustration; it matches a "word" followed by an equals sign.

The structure would be

(pattern_to_repeat)?*(?=pattern_to_stop_before)

Brad el 15 de Nov. de 2013

Abrir en MATLAB Online

Walter,

So I got a search pattern that works for these blocks of data when there is a single occurence of the Type: block (which contains the AD ID and Time Tag values).

exp = '([\d\.]+)\s+Band[^A]+?AD ID:\s+(\d+).+\?*(?=MSN)';

However, when there are multiple Type: blocks, I get only the first occurence. I'm sure it's due to the look ahead (?=MSN)

So I've attempted to add a repeating pattern for the expression. So far, the only luck I've had is with the following;

exp = '(([\d\.]+)\s+Band[^A]+?AD ID:\s+(\d+).)+\?*((?:\w*).*?)(?=MSN)';

This found only the first occurrence of the Type: blocks and produced NaNs instead of true Time Tag and AD ID values.

I am completely stumped as to how to get the repeating pattern. Maybe this is outside the functionality of REGEXP.

Iniciar sesión para comentar.

How, if possible, do I limit the number of times REGEXP searches for a specific pattern?

3 comentarios
Mostrar 1 comentario más antiguo Ocultar 1 comentario más antiguo

Respuesta aceptada

2 comentarios
Mostrar Ninguno Ocultar Ninguno

Más respuestas (1)

7 comentarios
Mostrar 5 comentarios más antiguos Ocultar 5 comentarios más antiguos

Categorías

Productos

Etiquetas

Community Treasure Hunt

How, if possible, do I limit the number of times REGEXP searches for a specific pattern?

3 comentarios Mostrar 1 comentario más antiguo Ocultar 1 comentario más antiguo

Respuesta aceptada

2 comentarios Mostrar Ninguno Ocultar Ninguno

Más respuestas (1)

7 comentarios Mostrar 5 comentarios más antiguos Ocultar 5 comentarios más antiguos

Categorías

Productos

Etiquetas

Ver también

Community Treasure Hunt

3 comentarios
Mostrar 1 comentario más antiguo Ocultar 1 comentario más antiguo

2 comentarios
Mostrar Ninguno Ocultar Ninguno

7 comentarios
Mostrar 5 comentarios más antiguos Ocultar 5 comentarios más antiguos