Extracting length information of pattern from specific string (not fixed string)

Hi guys!
I want to implement in matlab function that gets in its input (String , substring) , output the all following data that following my substring, the length of String isn't already known, this means I need to exctract the length of my following Data that I need .
Assumptions:
the length of my following data after occurrence of "0101" isn't already known, I must extract the length from the immediate 8bit that follows the occurrence/appearance of my substring (the length of all my following data after occurrence/appearance my substring is always represented in 8bit in binary and it's always the immediate followed 8bit after occurrence of my substring), all the following data length are the same at each occurance this means that the output matrix columns are the same for all the occurance but I still have to read them and its value (length value are the same at each occurance of my substring "0101").
for example:
string="0101000100001111111111100000001000010100010000111111111110000011000" , substring is always constant and it's "0101".
00010000-> 16 in decimal.
so here the output is the 16 followed data after ("00010000") which it's: 1111111111100000 , how do I know the length of my following data? it's given in the String itself immediately after appearance of substring "0101" and the length is always 8bit !, so here in my question the immediate 8bit followed to my substring ("0101") represents the following data after those 8bit, so here the immediate following 8bit after appearance "0101" is 00010000 and in decimal It's 16 , this 16 is the length of the data that I want to take/output after the 8bits that represetns the size of the following data, so here in my case I look at "0101" and then I must read the 8bit that immediately following it , that 8bit represents the length, so I need to convert the 8bit in decimal value (in my case it's 16) and take all the following data that comes after that 8bit of length represenation which its size is represented in binary in the immediate 8bit followed by occurrence substring(by occurance "0101") ; As a result the output here is 1111111111100000.
the output is:
output=[1111111111100000 ; 1111111111100000] , each row again represents respectively all following data at each occurrence, and first row represents first occurrence, second row represents second occurrence ....respectively ..etc
Another example:
String="01010000111111111111111000001000100101000011111111111111100010111111" , substring is always constant and it's "0101".
00001111 -> 15 in decimal for first occurance of "0101"
so here the output is the 16 followed data after ("00010000") which it's: 111111111110000, how do I know the length of my following data? it's given in the String itself immediately after appearance of substring "0101" and the length is always 8bit !, so here in my question the immediate 8bit followed to my substring ("0101") represents the following data after those 8bit, so here the immediate following 8bit after appearance "0101" is 00001111 and in decimal It's 15 , this 15 is the length of the data that I want to take/output after the 8bits that represetns the size of the following data, so here in my case I look at "0101" and then I must read the 8bit that immediately following it , that 8bit represents the length, so I need to convert the 8bit in decimal value (in my case it's 15) and take all the following data that comes after that 8bit of length represenation which its size is represented in binary in the immediate 8bit followed by occurance substring(by occurance "0101") ; As a result the output here is 111111111110000. (15 offset data that immediately following what I marked on the first occurance of 0101)
00001111 -> 15 in decimal for second occurance of "0101" and the 15 following bit after the 8bit of the length representation is
111111111110001 (15 offset data that immediately following what I marked on the second occurance of 0101)
So the output matrix is two rows because there's two occurance of "0101" , the number of rows is equal to the number of occurance of my substring 0101, and at each row represents the immediated followed data at the current length that I've read it from the immediate 8bit followed by my substring occurance.
the output is:
output=[111111111110000; 111111111110001] , each row again represents respectively all following data at each occurrence, and first row represents first occurrence, second row represents second occurrence ....respectively ..etc
I need to check the length representation (8bit followed immediately at each occurrence of my substring "0101" , it should be the same length at each occurrence of my substring but I need to check it , so you can assume that I must read the length at each occurrence and it should be the same length on whole occurrence of my substrings but I need to check/read it at every occurrence although it must be the same value ..
Note - there can be more than one occurrence of my substring "0101" in my string, I need to return all the followed data respectively to what I explained above in a row of matrix (this means first row represents all offset data that follows first occurance of my substring, the second row represents all offset data that follows the second occurrence...etc ... ) there can't be overlaping between occurance..so assume all work fine and there's no overlaps between occurance (there's always enough data between one occurrence to another .. ).
my substring occurrences can be anywhere and not specifically at the beginning of my string !
so it could be inputs string=[11111111101010000111111111111111000001000100101000011111111111111100010111111]
the function that I tried to implement in matlab is: (I get wrong outputs unfortunately):
function TruncateSubstringResultCheck= TruncateSyncWordResultCheck(input1,substring) %input1 is my string , my substring as I said in my case it's always "0101"
positions = strfind(input1, substring) ;
TruncatedSubstring= cell2mat(arrayfun(@(idx) input1(idx+length(substring):idx+length(substring)+N-1), positions, 'uniform', 0 ).');
for i=1:NumberOfRows
substring = TruncatedSubstring(i,:);
TruncateSubstringResultCheck(i,:)=substring;
end
Could anyone help me to fix that and get the required output ? thanks for any assistance !

4 comentarios

Is the substring 0101 always the start of the string, or does the location of the first instance of 0101 need to be found?
I see now that you mentioned that there could be multiple occurrences of 0101. I can adapt my solution for that case, pretty easily. I'll do that later today, if no one else has done so.
Hi , the position of "0101" could be anywhere in my given input -string- , and it could be more than one time occurred (there's no overlaps between each occurance of "0101"), so if there's for instance three occurance of my substring "0101" in my string, so my output is a matrix with 3 rows .. (3 rows because there's 3 time occurance of my substring, the number of rows is equal to the number of occurance of my substring, and respectively each row represents each occurance appropriately -occurance number one represents the first row, second occurance represents the second row ..etc )
thanks !
because just all things messed up here in my thread, I updated new thread here for more clarifications and more detailed:
it would be appreciated if you can help! thanks alot
hope it's now more clear and understandable.

Iniciar sesión para comentar.

 Respuesta aceptada

Stephen23
Stephen23 el 21 de Ag. de 2020
Editada: Stephen23 el 21 de Ag. de 2020
One simple dynamic regular expression can do this quite efficiently:
>> fun = @(s)sprintf('[01]{%d}',bin2dec(s));
>> rgx = '0101([01]{8})((??@fun($1)))';
>> str = '010100001111111111111110000010101001010000111111111111111000001111100101000010001111111111100000111110';
>> tkn = regexp(str,rgx,'tokens');
>> tkn = vertcat(tkn{:});
>> out = tkn(:,2);
>> out{:}
ans =
111111111110000
ans =
0111111111111111000001111100101000010001
Note that this returns an output following the rules that you described, and so does not match the (incorrect) examples.

9 comentarios

This is a slick solution, Stephen. I had been thinking about something like this, but was concerned that if the expression 0101 appeared inside the 8-bit length-identifying string, then it would give an incorrect result.
Especially given the current confusion on valid input/output, I decided to set it on the back burner. :-)
Jimmy cho
Jimmy cho el 22 de Ag. de 2020
Editada: Jimmy cho el 22 de Ag. de 2020
Hi !
I edited my question again and hope it's now more cleared and explained!
thanks for any assistance.
Jimmy cho
Jimmy cho el 22 de Ag. de 2020
Editada: Jimmy cho el 22 de Ag. de 2020
Hi ! it sounds it works fine for me but I have a problem, I want to save the output (all ans that are output of each occurrences in a matrix that each row store respectively each following data of every occurrence of my substring ..this means that at first row in my output matrix it will have all following data that first occurrence have, second row in my matrix output will have all following data that second occurrence have, third row ..etc.
note that number of clumns of the matrix is equal for each occurrence of my substring (this means all following data for each occurance are on the same length but still I need to check the length for every occurrence of my substring as what I explained above in my thread)
How can I implement that in matlab according to your solution here? thanks alot.
And what your write isn't good and there's Error/Bug that mentioned here before, if the expression 0101 appeared inside the 8-bit length-identifying string, then it would give an incorrect result!
the 8bit length identifying string is immediately follows the substring "0101" occurrence/appearance and maybe there's "0101000010100000" , here after occurrence of 0101 , the 8bit length identifying is always immediately follows "0101" substring after each occurrence of my substring - so here in my case it's 0000101 so here the length is 5 in decimal so I need to return 5bit following data that follows the 8bit length identifying, so I need to return the "00000"
@Stephen Cobeldick
Moreover, about what your code solution here for those two rows:
>> fun = @(s)sprintf('[01]{%d}',bin2dec(s));
>> rgx = '0101([01]{8})((??@fun($1)))';
For understanding what your code does, if my length variable binary representation is stored as two Bytes, it means it's 16bit -not 8bit as what I mentioned before - as I change those two row code to this: (because here in my example I need to read all 16bit data that follows my substring)
>> fun = @(s)sprintf('[01]{%d}',bin2dec(s));
>> rgx = '0101([01]{16})((??@fun($1)))';
right?! this is just verification if I understand you code correctly!
thanks alot.
"How can I implement that in matlab according to your solution here? thanks alot."
It is already implemented: the cell array out contains the binary data. Which is why I named it out.
"And what your write isn't good..."
Thank you, it is nice to be appreciated sometimes.
"... and there's Error/Bug that mentioned here before, if the expression 0101 appeared inside the 8-bit length-identifying string, then it would give an incorrect result!"
No, that is not how regular expression parsers work. Although the cyclist (incorrectly) mentioned this earlier, in fact the regular expression parser does not overlap applications of the regular expression. Once those characters have been consumed by the regular expression parser they will not be matched by any following application of the regular expression. So once the regular expression parser has consumed '0101' and the eight following digits then those consumed digits will NOT be compared against the start of the next application of the regular expression.
'...the 8bit length identifying string is immediately follows the substring "0101" ... "0101000010100000"... it's 0000101 so here the length is 5 in decimal ... return the "00000" '
It does return that. But first we need to fix your example 8-bit substring, which actually only has 7 bits:
0101000010100000 % your example string
0000101 % your example length substring (7 bits)
00000 % your example output substring (5 bits)
We can fix your example to check that your (and the cyclists) understanding of regular expressions is incorrect, here I added an extra zero so that the length substring is actually 8 bits long:
>> str = '01010000010100000'; % my string (with actual 8-bit length substring).
>> fun = @(s)sprintf('[01]{%d}',bin2dec(s));
>> rgx = '0101([01]{8})((??@fun($1)))';
>> tkn = regexp(str,rgx,'tokens');
>> tkn = vertcat(tkn{:});
>> out = tkn(:,2);
>> out{:}
ans =
00000
Can you give an example of a string which demonstrates that supposed "bug"? (I doubt that you can).
"this is just verification if I understand you code correctly!"
Yes, your modification is correct for 16-bit length substrings.
Jimmy cho
Jimmy cho el 22 de Ag. de 2020
Editada: Jimmy cho el 22 de Ag. de 2020
Thanks for your answer, really appreciated!
As I said the substring occurrence isn't always at the beginning of my string and it could be occurred more than once in my string.
about an example of a string , str='00010100000101000001010000010100000'
here you're right give me correct answer but doesn't return me a matrix that at every row of the matrix appears the all following data according to the 8-bit length identifying. so here the output is a matrix:
output=[00000;00000] , first row represents all following data for first occurrence. (the output needs to be a matrix that its rows according to number of occurrences of my substring "0101")
second row represents all following data for second occurrence.
there's two rows because there's just two occurrence of my substring "0101", as a result the number of rows of my output matrix is equal to number of times occurrence of my substring "0101".
Another question @Stephen Cobeldick if I changed my input str to binary integer values and not as string, so what should exactly I fix in your solution code?
I will explain what my issue, my input(str) isn't string it's a binary array integers ..and the output is a matrix of binary integers array and not strings!, I guess it's not a big problem but how could I fix your code implementation to work with input str=[00010100000101000001010000010100000] array of binary integers and not strings.., the same as the output isn't a string matrix..it's a binary integers array matrix (I mean every row of the output matrix is an array of binary integers values and not a string as what you implemented before)
@Stephen, you misinterpreted my earlier comment. I meant that I had not pursued this type of solution because I did not know how the regular expression would work in this case. (But now I do, thanks to you!) I didn't mean that I thought it would not work.
Really glad to see this works as intended. It's certainly the more elegant algorithm.
"output=[00000;00000]"
If the input is a character vector and the data subvectors can have different lengths then it is not possible to concatenate them into one character matrix. You could pad them to have the same length and then concatenate them together. Or convert to string, in which case you will get a vector of strings (where each element is a scalar string with a different number of characters).
Converting to numeric is possible, but note that apart from some coincidental visual similarity, the decimal number 101 is totally unrelated to the binary number 101.
"I will explain what my issue, my input(str) isn't string it's a binary array integers.... str=[00010100000101000001010000010100000]"
You example cannot be stored as one integer by any standard integer class supported by MATLAB. Perhaps you actually meant that each of those digits are a separate element of an integer array, e.g.:
vec = [0,0,0,1,0,1,0,0,0,0,0,1,0,1,0,0,0,0,0,1,0,1,0,0,0,0,0,1,0,1,0,0,0,0,0];
in which case you can trivially convert those integers to character:
str = sprintf('%d',vec);
Yup, Understood.
appreciated!

Iniciar sesión para comentar.

Más respuestas (3)

the cyclist
the cyclist el 20 de Ag. de 2020
Editada: the cyclist el 20 de Ag. de 2020
If 0101 is always at the beginning of the string, then
% Example input
str ="0101000011111111111111100000101010";
% The 8 digits after 0101 define the length.
% In other words, the 5th to 12th digits.
L = bin2dec(extractBetween(str,5,12));
% The L digits after 0101 and the next 8, are the output string.
% In other words, start from the 13th digit, and get L digits.
output = extractBetween(str,13,12+L);
or if you actually have a character array :
% Example input
str ='0101000011111111111111100000101010';
% The 8 digits after 0101 define the length.
% In other words, the 5th to 12th digits.
L = bin2dec(str(5:12));
% The L digits after 0101 and the next 8, are the output string.
% In other words, start from the 13th digit, and get L digits.
output = str(13:(12+L));

1 comentario

Jimmy cho
Jimmy cho el 20 de Ag. de 2020
Editada: Jimmy cho el 20 de Ag. de 2020
Hi , the position of "0101" could be anywhere in my given input -string- , and it could be more than one time occurred (there's no overlaps between each occurance of "0101"), so if there's for instance three occurance of my substring "0101" in my string, so my output is a matrix with 3 rows .. (3 rows because there's 3 time occurance of my substring, the number of rows is equal to the number of occurance of my substring, and respectively each row represents each occurance appropriately -occurance number one represents the first row, second occurance represents the second row ..etc )
thanks !
it doesn't give me matrix if there's more than one occurance of my substring in my string as what I explained here in my comment above.

Iniciar sesión para comentar.

% Sample input
str ="0101000000011010100000001101010000000110101000000011010100000001101010000000110101000000111110101";
% Initialize with first index of 0101, and string length
idx0101 = regexp(str,"0101","once");
strL = strlength(str);
% Initialize string array for output
segments = strings(0);
% Loop over string, while it is long enough to hold 0101 and the lenght
% identifier segment
while strL >= idx0101 + 11
% Find the segment length
segmentL = bin2dec(extractBetween(str,idx0101+4,idx0101+11));
% If the string is long enough to contain a string of that length,
% extract it
if strL >= 12+segmentL
% Pull segment of the correct length
thisSegment = extractBetween(str,idx0101+12,idx0101+11+segmentL);
% Append the segment to the array
segments = [segments; thisSegment];
% Remove the segment and its identifiers
str = extractAfter(str,idx0101+11+segmentL);
% Find the length of the shortened string, and first location of
% "0101", so that we can start over
strL = strlength(str);
idx0101 = regexp(str,"0101","once");
else
break % Break out of the loop if the string is not long enough to have a new segment
end
end

11 comentarios

Jimmy cho
Jimmy cho el 20 de Ag. de 2020
Editada: Jimmy cho el 20 de Ag. de 2020
Hi , the position of "0101" could be anywhere in my given input -string- , and it could be more than one time occurred (there's no overlaps between each occurance of "0101"), so if there's for instance three occurance of my substring "0101" in my string, so my output is a matrix with 3 rows .. (3 rows because there's 3 time occurance of my substring, the number of rows is equal to the number of occurance of my substring, and respectively each row represents each occurance appropriately -occurance number one represents the first row, second occurance represents the second row ..etc )
thanks !
This solution isn't working properly as what I said above, and it looks that it's much long and not compacty no? I guess but don't really /actually know .
the cyclist
the cyclist el 20 de Ag. de 2020
Editada: the cyclist el 20 de Ag. de 2020
Most of the complexity comes from the fact that one needs to check when your finished, which can happen in two ways:
  • You find 0101, but there are not 8 bits immediately following
  • You find 0101, and there are 8 bits following it, but the segment length specified is longer than the rest of the string
If you can guarantee that that never happens in the string, then all that checking could be removed.
Also, in all three of your examples, you seem to have inaccurate statements:
First example
You say that the output should be length 16, but the output you showed here (111111111100000) is length 15.
Second example
You say that the output should be length 15, but the output you showed here (11111111110000) is length 14.
Third example
You state "00010000 -> 10 in decimal", but actually that is 16 again.
So, I went by what you described, which seemed consistent, and didn't pay much attention to your example output.
Can you please double-check your inputs and outputs, or come up with some more (especially one that requires more than one output), and then we can see where the discrepancy is.
Hi ! sorry I just missed some bits while writting, I updated my thread above!
for first example, I updated the output in my thread above-the same as what you stated here.
for second example, I updated the output in my thread above -the same as what you stated here.
for third example - "00001000" -> 10 in decimal.
the cyclist
the cyclist el 20 de Ag. de 2020
Editada: the cyclist el 20 de Ag. de 2020
First example
The output in the edited question is still only length 15. (What I posted was your incorrect output. Don't copy mine.)
Second example
The output in the edited question is still only length 14. (What I posted was your incorrect output. Don't copy mine.)
Third example
Binary 00001000 is actually decimal 8, not decimal 10. Your example output is presumably wrong because of that.
Again, if you can put correct input/output, we can check the algorithm.
Another example, I mean that my function return matrix which at every row represents the following data regarding to my length at each occurance of my substring, this means:
lets assume I have an example=
string=""010100001111111111111110000010101001010000111111111111111000001111100101000010001111111111100000111110";
substring="0101"
so the output is according to each occurance will be at each row the all following data according to occurance index, I mean by this,
for first occurance substring, its all following data will be stored as row in the first row in my matrix because it's related to first occurance.
for second occurance of my substring, its all following data will be stored as a row in the second row in my matrix because it's related to the second occurance .
...etc !
so the output is a matrix of all the following data for each occurance of my substring that's stored as rows in my matrix respectively:
output[11111111111000000; 111111111110000 ; 1111111111] ;
Just do
str2double(segments)
to convert the output of my algorithm to the numeric array you want.
You will need to be careful if you have very long segments, as you could lose numerical precision.
It is not clear how you generate the expected output from the sample string. These are the data from your example:
010100001111111111111110000010101001010000111111111111111000001111100101000010001111111111100000111110
0101 0101 0101
00001111 00001111 00001000
11111111111000000 111111111110000 1111111111
But in fact there is an earlier location that matches second 0101:
010100001111111111111110000010101001010000111111111111111000001111100101000010001111111111100000111110
0101 0101
00001111 00101000
11111111111000000 0111111111111111000001111100101000010001
What is the rule for ignoring that earlier match location?
We can also see that the first expected output string ends with a zero, although there is actually a one in that location in the input string. I presume that is simply a typographical mistake. Your first expected output also contains 17 characters, but 00001111 is actually 15, not 17. Once we correct this as well, we get this:
010100001111111111111110000010101001010000111111111111111000001111100101000010001111111111100000111110
0101 0101
00001111 00101000
111111111110000 0111111111111111000001111100101000010001
My solution here gives the output that you specified for the input/ouput combinations you specified in the other location, if you do
str2double(segments)
as I sugested.
Jimmy cho
Jimmy cho el 22 de Ag. de 2020
Editada: Jimmy cho el 22 de Ag. de 2020
Hi @the cyclist !
I edited my question again and hope it's now more cleared and explained!
@the cyclist
Anyway I appreciate your effort mate!

Iniciar sesión para comentar.

This is an answer to the follow_up question, which was closed when I tried to submit.
%%
chr = '01010000111111111111111000001000100101000011111111111111100010111111';
sbs = '0101';
%%
pos = strfind( chr, sbs );
out = cell( numel(pos), 1 );
%%
for pp = 1 : numel(pos)
ix1 = pos(pp) + 4;
ix2 = ix1 + 8 - 1;
if ix2+len <= numel(chr)
len = bin2dec( chr(ix1:ix2) );
out{pp,1} = chr(ix2+1:ix2+len);
else
out(pp) = [];
end
end
%%
output = string( out );
This script prints
output =
2×1 string array
"111111111110000"
"111111111110001"
And the script outputs the same result for
chr = 'xxxxxxxxxxxx01010000111111111111111000001000100101000011111111111111100010111111';
and for
chr = '11111111101010000111111111111111000001000100101000011111111111111100010111111';
There is at least one problem with the script and that is handling of the case where the distance between substring is less than 12+1 positions.

3 comentarios

Jimmy cho
Jimmy cho el 22 de Ag. de 2020
Editada: Jimmy cho el 22 de Ag. de 2020
Hi !
I edited my question again and hope it's now more cleared and explained!
thanks for any assistance.
thanks for instructing me to re-edit my question here, hope it's now more understandable and explained
it doesn't work for me because the input is actually array of binary integers and substring is an array of binary integers -sorry for not mentioning that in my thread!!!!!! . As a result my string is an array of binary integers, also my substring is [0 1 0 1] which it's array of binary integers.
so according to your solution it doesn't work for me might be because the inputs are binary array integers
%%
chr = [01010000111111111111111000001000100101000011111111111111100010111111];
sbs = [0101];
%%
pos = strfind( chr, sbs );
out = cell( numel(pos), 1 );
%%
for pp = 1 : numel(pos)
ix1 = pos(pp) + 4;
ix2 = ix1 + 8 - 1;
if ix2+len <= numel(chr)
len = bin2dec( chr(ix1:ix2) );
out{pp,1} = chr(ix2+1:ix2+len);
else
out(pp) = [];
end
end
%%
output = string( out );

Iniciar sesión para comentar.

Categorías

Más información sobre Characters and Strings en Centro de ayuda y File Exchange.

Preguntada:

el 20 de Ag. de 2020

Comentada:

el 23 de Ag. de 2020

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by