Extracting length information of pattern from specific string (not fixed string)

Question

0 votos

Hi guys!

I want to implement in matlab function that gets in its input (String , substring) , output the all following data that following my substring, the length of String isn't already known, this means I need to exctract the length of my following Data that I need .

Assumptions:

the length of my following data after occurrence of "0101" isn't already known, I must extract the length from the immediate 8bit that follows the occurrence/appearance of my substring (the length of all my following data after occurrence/appearance my substring is always represented in 8bit in binary and it's always the immediate followed 8bit after occurrence of my substring), all the following data length are the same at each occurance this means that the output matrix columns are the same for all the occurance but I still have to read them and its value (length value are the same at each occurance of my substring "0101").

for example:

string="0101000100001111111111100000001000010100010000111111111110000011000" , substring is always constant and it's "0101".

00010000-> 16 in decimal.

so here the output is the 16 followed data after ("00010000") which it's: 1111111111100000 , how do I know the length of my following data? it's given in the String itself immediately after appearance of substring "0101" and the length is always 8bit !, so here in my question the immediate 8bit followed to my substring ("0101") represents the following data after those 8bit, so here the immediate following 8bit after appearance "0101" is 00010000 and in decimal It's 16 , this 16 is the length of the data that I want to take/output after the 8bits that represetns the size of the following data, so here in my case I look at "0101" and then I must read the 8bit that immediately following it , that 8bit represents the length, so I need to convert the 8bit in decimal value (in my case it's 16) and take all the following data that comes after that 8bit of length represenation which its size is represented in binary in the immediate 8bit followed by occurrence substring(by occurance "0101") ; As a result the output here is 1111111111100000.

the output is:

output=[1111111111100000 ; 1111111111100000] , each row again represents respectively all following data at each occurrence, and first row represents first occurrence, second row represents second occurrence ....respectively ..etc

Another example:

String="01010000111111111111111000001000100101000011111111111111100010111111" , substring is always constant and it's "0101".

00001111 -> 15 in decimal for first occurance of "0101"

so here the output is the 16 followed data after ("00010000") which it's: 111111111110000, how do I know the length of my following data? it's given in the String itself immediately after appearance of substring "0101" and the length is always 8bit !, so here in my question the immediate 8bit followed to my substring ("0101") represents the following data after those 8bit, so here the immediate following 8bit after appearance "0101" is 00001111 and in decimal It's 15 , this 15 is the length of the data that I want to take/output after the 8bits that represetns the size of the following data, so here in my case I look at "0101" and then I must read the 8bit that immediately following it , that 8bit represents the length, so I need to convert the 8bit in decimal value (in my case it's 15) and take all the following data that comes after that 8bit of length represenation which its size is represented in binary in the immediate 8bit followed by occurance substring(by occurance "0101") ; As a result the output here is 111111111110000. (15 offset data that immediately following what I marked on the first occurance of 0101)

00001111 -> 15 in decimal for second occurance of "0101" and the 15 following bit after the 8bit of the length representation is

111111111110001 (15 offset data that immediately following what I marked on the second occurance of 0101)

So the output matrix is two rows because there's two occurance of "0101" , the number of rows is equal to the number of occurance of my substring 0101, and at each row represents the immediated followed data at the current length that I've read it from the immediate 8bit followed by my substring occurance.

the output is:

output=[111111111110000; 111111111110001] , each row again represents respectively all following data at each occurrence, and first row represents first occurrence, second row represents second occurrence ....respectively ..etc

I need to check the length representation (8bit followed immediately at each occurrence of my substring "0101" , it should be the same length at each occurrence of my substring but I need to check it , so you can assume that I must read the length at each occurrence and it should be the same length on whole occurrence of my substrings but I need to check/read it at every occurrence although it must be the same value ..

Note - there can be more than one occurrence of my substring "0101" in my string, I need to return all the followed data respectively to what I explained above in a row of matrix (this means first row represents all offset data that follows first occurance of my substring, the second row represents all offset data that follows the second occurrence...etc ... ) there can't be overlaping between occurance..so assume all work fine and there's no overlaps between occurance (there's always enough data between one occurrence to another .. ).

my substring occurrences can be anywhere and not specifically at the beginning of my string !

so it could be inputs string=[11111111101010000111111111111111000001000100101000011111111111111100010111111]

the function that I tried to implement in matlab is: (I get wrong outputs unfortunately):

function TruncateSubstringResultCheck= TruncateSyncWordResultCheck(input1,substring)  %input1 is my string , my substring as I said in my case it's always "0101"
positions = strfind(input1, substring) ;                                            
TruncatedSubstring= cell2mat(arrayfun(@(idx) input1(idx+length(substring):idx+length(substring)+N-1), positions, 'uniform', 0 ).');                                            
for i=1:NumberOfRows
    substring = TruncatedSubstring(i,:);                                                                   
   TruncateSubstringResultCheck(i,:)=substring;
end

Could anyone help me to fix that and get the required output ? thanks for any assistance !

4 comentarios
Mostrar 2 comentarios más antiguos Ocultar 2 comentarios más antiguos

Jimmy cho el 20 de Ag. de 2020

Hi , the position of "0101" could be anywhere in my given input -string- , and it could be more than one time occurred (there's no overlaps between each occurance of "0101"), so if there's for instance three occurance of my substring "0101" in my string, so my output is a matrix with 3 rows .. (3 rows because there's 3 time occurance of my substring, the number of rows is equal to the number of occurance of my substring, and respectively each row represents each occurance appropriately -occurance number one represents the first row, second occurance represents the second row ..etc )

thanks !

Jimmy cho el 21 de Ag. de 2020

because just all things messed up here in my thread, I updated new thread here for more clarifications and more detailed:

https://www.mathworks.com/matlabcentral/answers/582914-extracting-specific-string-according-to-variable-length-length-is-changeable

it would be appreciated if you can help! thanks alot

hope it's now more clear and understandable.

Iniciar sesión para comentar.

Iniciar sesión para responder a esta pregunta.

Follow Question

Answer 1

Stephen23 el 21 de Ag. de 2020

Editada: Stephen23 el 21 de Ag. de 2020

Abrir en MATLAB Online

1 voto

One simple dynamic regular expression can do this quite efficiently:

>> fun = @(s)sprintf('[01]{%d}',bin2dec(s));
>> rgx = '0101([01]{8})((??@fun($1)))';
>> str = '010100001111111111111110000010101001010000111111111111111000001111100101000010001111111111100000111110';
>> tkn = regexp(str,rgx,'tokens');
>> tkn = vertcat(tkn{:});
>> out = tkn(:,2);
>> out{:}
ans =
111111111110000
ans =
0111111111111111000001111100101000010001

Note that this returns an output following the rules that you described, and so does not match the (incorrect) examples.

9 comentarios
Mostrar 7 comentarios más antiguos Ocultar 7 comentarios más antiguos

Jimmy cho el 22 de Ag. de 2020

Editada: Jimmy cho el 22 de Ag. de 2020

Hi ! it sounds it works fine for me but I have a problem, I want to save the output (all ans that are output of each occurrences in a matrix that each row store respectively each following data of every occurrence of my substring ..this means that at first row in my output matrix it will have all following data that first occurrence have, second row in my matrix output will have all following data that second occurrence have, third row ..etc.

note that number of clumns of the matrix is equal for each occurrence of my substring (this means all following data for each occurance are on the same length but still I need to check the length for every occurrence of my substring as what I explained above in my thread)

How can I implement that in matlab according to your solution here? thanks alot.

And what your write isn't good and there's Error/Bug that mentioned here before, if the expression 0101 appeared inside the 8-bit length-identifying string, then it would give an incorrect result!

the 8bit length identifying string is immediately follows the substring "0101" occurrence/appearance and maybe there's "0101000010100000" , here after occurrence of 0101 , the 8bit length identifying is always immediately follows "0101" substring after each occurrence of my substring - so here in my case it's 0000101 so here the length is 5 in decimal so I need to return 5bit following data that follows the 8bit length identifying, so I need to return the "00000"

Jimmy cho el 22 de Ag. de 2020

Editada: Jimmy cho el 22 de Ag. de 2020

Thanks for your answer, really appreciated!

As I said the substring occurrence isn't always at the beginning of my string and it could be occurred more than once in my string.

about an example of a string , str='00010100000101000001010000010100000'

here you're right give me correct answer but doesn't return me a matrix that at every row of the matrix appears the all following data according to the 8-bit length identifying. so here the output is a matrix:

output=[00000;00000] , first row represents all following data for first occurrence. (the output needs to be a matrix that its rows according to number of occurrences of my substring "0101")

second row represents all following data for second occurrence.

there's two rows because there's just two occurrence of my substring "0101", as a result the number of rows of my output matrix is equal to number of times occurrence of my substring "0101".

Another question @Stephen Cobeldick if I changed my input str to binary integer values and not as string, so what should exactly I fix in your solution code?

I will explain what my issue, my input(str) isn't string it's a binary array integers ..and the output is a matrix of binary integers array and not strings!, I guess it's not a big problem but how could I fix your code implementation to work with input str=[00010100000101000001010000010100000] array of binary integers and not strings.., the same as the output isn't a string matrix..it's a binary integers array matrix (I mean every row of the output matrix is an array of binary integers values and not a string as what you implemented before)

Stephen23 el 22 de Ag. de 2020

Abrir en MATLAB Online

"output=[00000;00000]"

If the input is a character vector and the data subvectors can have different lengths then it is not possible to concatenate them into one character matrix. You could pad them to have the same length and then concatenate them together. Or convert to string, in which case you will get a vector of strings (where each element is a scalar string with a different number of characters).

Converting to numeric is possible, but note that apart from some coincidental visual similarity, the decimal number 101 is totally unrelated to the binary number 101.

"I will explain what my issue, my input(str) isn't string it's a binary array integers.... str=[00010100000101000001010000010100000]"

You example cannot be stored as one integer by any standard integer class supported by MATLAB. Perhaps you actually meant that each of those digits are a separate element of an integer array, e.g.:

vec = [0,0,0,1,0,1,0,0,0,0,0,1,0,1,0,0,0,0,0,1,0,1,0,0,0,0,0,1,0,1,0,0,0,0,0];

in which case you can trivially convert those integers to character:

str = sprintf('%d',vec);

Jimmy cho el 23 de Ag. de 2020

Yup, Understood.

appreciated!

Iniciar sesión para comentar.

Answer 2

the cyclist el 20 de Ag. de 2020

Editada: the cyclist el 20 de Ag. de 2020

Abrir en MATLAB Online

1 voto

If 0101 is always at the beginning of the string, then

% Example input
str ="0101000011111111111111100000101010";
% The 8 digits after 0101 define the length.
% In other words, the 5th to 12th digits.
L = bin2dec(extractBetween(str,5,12));
% The L digits after 0101 and the next 8, are the output string.
% In other words, start from the 13th digit, and get L digits.
output = extractBetween(str,13,12+L);

or if you actually have a character array :

% Example input
str ='0101000011111111111111100000101010';
% The 8 digits after 0101 define the length.
% In other words, the 5th to 12th digits.
L = bin2dec(str(5:12));
% The L digits after 0101 and the next 8, are the output string.
% In other words, start from the 13th digit, and get L digits.
output = str(13:(12+L));

1 comentario
Mostrar -1 comentarios más antiguos Ocultar -1 comentarios más antiguos

Jimmy cho el 20 de Ag. de 2020

Editada: Jimmy cho el 20 de Ag. de 2020

Hi , the position of "0101" could be anywhere in my given input -string- , and it could be more than one time occurred (there's no overlaps between each occurance of "0101"), so if there's for instance three occurance of my substring "0101" in my string, so my output is a matrix with 3 rows .. (3 rows because there's 3 time occurance of my substring, the number of rows is equal to the number of occurance of my substring, and respectively each row represents each occurance appropriately -occurance number one represents the first row, second occurance represents the second row ..etc )

thanks !

it doesn't give me matrix if there's more than one occurance of my substring in my string as what I explained here in my comment above.

Iniciar sesión para comentar.

Answer 3

the cyclist el 20 de Ag. de 2020

Abrir en MATLAB Online

1 voto

% Sample input
str ="0101000000011010100000001101010000000110101000000011010100000001101010000000110101000000111110101";
% Initialize with first index of 0101, and string length
idx0101 = regexp(str,"0101","once");
strL = strlength(str);
% Initialize string array for output
segments = strings(0);
% Loop over string, while it is long enough to hold 0101 and the lenght
% identifier segment
while strL >= idx0101 + 11 
    
    % Find the segment length
    segmentL = bin2dec(extractBetween(str,idx0101+4,idx0101+11));
    
    % If the string is long enough to contain a string of that length,
    % extract it
    if strL >= 12+segmentL
        
        % Pull segment of the correct length
        thisSegment = extractBetween(str,idx0101+12,idx0101+11+segmentL);
        
        % Append the segment to the array
        segments = [segments; thisSegment];
        
        % Remove the segment and its identifiers
        str = extractAfter(str,idx0101+11+segmentL);
        
        % Find the length of the shortened string, and first location of
        % "0101", so that we can start over
        strL = strlength(str);
        idx0101 = regexp(str,"0101","once");
    
    else
        
        break % Break out of the loop if the string is not long enough to have a new segment
        
    end
    
end

11 comentarios
Mostrar 9 comentarios más antiguos Ocultar 9 comentarios más antiguos

the cyclist el 20 de Ag. de 2020

Editada: the cyclist el 20 de Ag. de 2020

Most of the complexity comes from the fact that one needs to check when your finished, which can happen in two ways:

You find 0101, but there are not 8 bits immediately following
You find 0101, and there are 8 bits following it, but the segment length specified is longer than the rest of the string

If you can guarantee that that never happens in the string, then all that checking could be removed.

Also, in all three of your examples, you seem to have inaccurate statements:

First example

You say that the output should be length 16, but the output you showed here (111111111100000) is length 15.

Second example

You say that the output should be length 15, but the output you showed here (11111111110000) is length 14.

Third example

You state "00010000 -> 10 in decimal", but actually that is 16 again.

So, I went by what you described, which seemed consistent, and didn't pay much attention to your example output.

Can you please double-check your inputs and outputs, or come up with some more (especially one that requires more than one output), and then we can see where the discrepancy is.

Stephen23 el 21 de Ag. de 2020

Editada: Stephen23 el 21 de Ag. de 2020

Abrir en MATLAB Online

It is not clear how you generate the expected output from the sample string. These are the data from your example:

010100001111111111111110000010101001010000111111111111111000001111100101000010001111111111100000111110
0101                              0101                              0101
    00001111                          00001111                          00001000
            11111111111000000                 111111111110000                   1111111111

But in fact there is an earlier location that matches second 0101:

010100001111111111111110000010101001010000111111111111111000001111100101000010001111111111100000111110
0101                         0101
    00001111                     00101000
            11111111111000000            0111111111111111000001111100101000010001

What is the rule for ignoring that earlier match location?

We can also see that the first expected output string ends with a zero, although there is actually a one in that location in the input string. I presume that is simply a typographical mistake. Your first expected output also contains 17 characters, but 00001111 is actually 15, not 17. Once we correct this as well, we get this:

010100001111111111111110000010101001010000111111111111111000001111100101000010001111111111100000111110
0101                         0101
    00001111                     00101000
            111111111110000              0111111111111111000001111100101000010001

Jimmy cho el 22 de Ag. de 2020

Editada: Jimmy cho el 22 de Ag. de 2020

Hi @the cyclist !

I edited my question again and hope it's now more cleared and explained!

Jimmy cho el 22 de Ag. de 2020

@the cyclist

Anyway I appreciate your effort mate!

Iniciar sesión para comentar.

Answer 4

per isakson el 22 de Ag. de 2020

Abrir en MATLAB Online

1 voto

This is an answer to the follow_up question, which was closed when I tried to submit.

%%
chr = '01010000111111111111111000001000100101000011111111111111100010111111';
sbs = '0101';
%%
pos = strfind( chr, sbs );
out = cell( numel(pos), 1 );
%%
for pp = 1 : numel(pos)
    ix1 = pos(pp) + 4;
    ix2 = ix1 + 8 - 1;
    if ix2+len <= numel(chr)
        len = bin2dec( chr(ix1:ix2) );
        out{pp,1} = chr(ix2+1:ix2+len);
    else
        out(pp) = [];
    end
end
%%
output = string( out );

This script prints

output = 
  2×1 string array
    "111111111110000"
    "111111111110001"

And the script outputs the same result for

chr = 'xxxxxxxxxxxx01010000111111111111111000001000100101000011111111111111100010111111';

and for

chr = '11111111101010000111111111111111000001000100101000011111111111111100010111111';

There is at least one problem with the script and that is handling of the case where the distance between substring is less than 12+1 positions.

3 comentarios
Mostrar 1 comentario más antiguo Ocultar 1 comentario más antiguo

Jimmy cho el 22 de Ag. de 2020

Hi @Walter Roberson

thanks for instructing me to re-edit my question here, hope it's now more understandable and explained

Jimmy cho el 22 de Ag. de 2020

Editada: Jimmy cho el 22 de Ag. de 2020

Abrir en MATLAB Online

Hi @per isakson !

it doesn't work for me because the input is actually array of binary integers and substring is an array of binary integers -sorry for not mentioning that in my thread!!!!!! . As a result my string is an array of binary integers, also my substring is [0 1 0 1] which it's array of binary integers.

so according to your solution it doesn't work for me might be because the inputs are binary array integers

%%
chr = [01010000111111111111111000001000100101000011111111111111100010111111];
sbs = [0101];
%%
pos = strfind( chr, sbs );
out = cell( numel(pos), 1 );
%%
for pp = 1 : numel(pos)
    ix1 = pos(pp) + 4;
    ix2 = ix1 + 8 - 1;
    if ix2+len <= numel(chr)
        len = bin2dec( chr(ix1:ix2) );
        out{pp,1} = chr(ix2+1:ix2+len);
    else
        out(pp) = [];
    end
end
%%
output = string( out );

Iniciar sesión para comentar.

Extracting length information of pattern from specific string (not fixed string)

4 comentarios
Mostrar 2 comentarios más antiguos Ocultar 2 comentarios más antiguos

Respuesta aceptada

9 comentarios
Mostrar 7 comentarios más antiguos Ocultar 7 comentarios más antiguos

Más respuestas (3)

1 comentario
Mostrar -1 comentarios más antiguos Ocultar -1 comentarios más antiguos

11 comentarios
Mostrar 9 comentarios más antiguos Ocultar 9 comentarios más antiguos

3 comentarios
Mostrar 1 comentario más antiguo Ocultar 1 comentario más antiguo

Categorías

Etiquetas

Community Treasure Hunt

Extracting length information of pattern from specific string (not fixed string)

4 comentarios Mostrar 2 comentarios más antiguos Ocultar 2 comentarios más antiguos

Respuesta aceptada

9 comentarios Mostrar 7 comentarios más antiguos Ocultar 7 comentarios más antiguos

Más respuestas (3)

1 comentario Mostrar -1 comentarios más antiguos Ocultar -1 comentarios más antiguos

11 comentarios Mostrar 9 comentarios más antiguos Ocultar 9 comentarios más antiguos

3 comentarios Mostrar 1 comentario más antiguo Ocultar 1 comentario más antiguo

Categorías

Etiquetas

Ver también

Community Treasure Hunt

4 comentarios
Mostrar 2 comentarios más antiguos Ocultar 2 comentarios más antiguos

9 comentarios
Mostrar 7 comentarios más antiguos Ocultar 7 comentarios más antiguos

1 comentario
Mostrar -1 comentarios más antiguos Ocultar -1 comentarios más antiguos

11 comentarios
Mostrar 9 comentarios más antiguos Ocultar 9 comentarios más antiguos

3 comentarios
Mostrar 1 comentario más antiguo Ocultar 1 comentario más antiguo