MATLAB Coder regexp Alternative
12 visualizaciones (últimos 30 días)
Mostrar comentarios más antiguos
Hello,
I am attempting to use MATLAB coder to convert a function I have for parsing l large text files for relevant data. I recently posted a related question in regards to the size of these files: https://www.mathworks.com/matlabcentral/answers/448915-large-text-file-datastore. Although the datastore option wasn't the best way for me to parse my data, I was able to sucessfully write a function in read the large ~30 GB + text files and reduce the data out of them.
My reason for using coder is the hope of speeding up the function. The function I have works by reading the text file in blocks but it must loop through each block a number of times looking for relevant data. Because of the data structure format and my lack of control over it, this is the only way I see it as possible to parse the data file. Additionally the function cannot be effectively vectorized. This leads me to believe that the inability to vectorize and multiple for loops will result in a speed up with utilizing codegen.
The problem I have now is converting the function to a format that codegen can use. All of the issues encountered thus far using codegen.screener( ) have been rather easily fixed with the exception of my usage of regexpr( ). My need to use this function arises from the fact that my text file blocks have a lot of "garbage" information present and I only need to extract certain expressions from each block. For example I might have page of data that looks similar to this:
Garbage garbage garbage = 2345 garbage lsasdfasdf
adfasdfasdfasdffas klasdfklnfa asdfasdflkasdf lkasdf
Relevant Data
X 1 = 20.100 X 2 = 30.200 X 3 = 40.100 .....
....
X ij1 = 12.012 X ij2 = 210.20 X ijk = 1000.1
Garbage garbage garbage = 2345 garbage lsasdfasdf
adfasdfasdfasdffas klasdfklnfa asdfasdflkasdf lkasdf
My way of handling this has been to use regexp like this:
load PageData.mat
ValueExpr = '\d+=(\S|\s+)\S+\s';
DataBlocks = regexp(PageData,ValueExpr,'match');
This piece of code will basically return all of the data in the format "X 1 = 20. 100" for all of the numbers from 1 to ijk in a cell array. I can then use strsplit to get the ijk values and measurement values.
This is the main crux of my problem. Although I am using regexp in many locations of the function for other tasks like splitting, trimming, etc. this particular usage of it seen above is the main backbone of the data extraction routine and I am unsure how to match this behavior with the other coder acceptable functions very easily. My best attempt would be to use something like strfind( ) to gather all the indices for "=" and loop through these indicies to get the data but I am not sure if there is a simpler way to get this into a coder acceptable format.
Any ideas would be greatly appreciated.
Respuestas (2)
Guillaume
el 25 de Mzo. de 2019
Like Walter, I was going to suggest delegating to another regular expression engine. Note that in modern C++ (C++ 11 and later), there's no more need for boost::regex, it's now part of the standard as std::regex.
However, you already have access to one or two other regular expression engines directly from Matlab.
You always have access to the java regular expression engine:
pattern = java.util.regex.Pattern.compile('\d+=(\S|\s+)\S+\s');
matcher = pattern.matcher(java.lang.String('garbage X 1=20.100 X 2=30.200 X 3=40.100 X 123=12.012 garbage'));
matches = {};
while matcher.find
matches = [matches; char(matcher.group)];
end
celldisp(matches)
On windows, you also have access to the .Net regular expression engine:
%in theory you should be able to create a regular expression:
%regex = System.Text.RegularExrpressions.Regex('\d+=(\S|\s+)\S+\s');
%and get a matchcollection
%matchcollection = regex.Matches('garbage X 1=20.100 X 2=30.200 X 3=40.100 X 123=12.012 garbage');
%but i get a 'no method with match signature' error that I don't understand right now
%Can use static Matches method instead:
matchcollection = System.Text.RegularExpressions.Regex.Matches('garbage X 1=20.100 X 2=30.200 X 3=40.100 X 123=12.012 garbage', '\d+=(\S|\s+)\S+\s');
matches = arrayfun(@(m) char(matchcollection.Item(m).Value), 0:matchcollection.Count-1, 'UniformOutput', false);
celldisp(matches)
3 comentarios
Guillaume
el 25 de Mzo. de 2019
@Walter, Yes I was thinking about that on my way home, and it's probably not supported. (I don't have coder, so don't know). However, if you're going to call a C++ function, you could still delegate to .Net
@Christopher, as far as I know regular expressions are not part of C. They are part of C++ (different languages despite some similarity). Here is an example of using std::regex. Bear in mind that I wrote that code back in 2013 when C++ was new and have hardly written any C++ since then, so there may be some more modern ways of doing it nowadays.
#include <regex>
int main(){
const char* pattern = "\d+=(\S|\s+)\S+\s"; //nowadays you'd probably use std::string
const char* search = "garbage X 1=20.100 X 2=30.200 X 3=40.100 X 123=12.012 garbage";
const std::regex re(pattern);
char* matched_text = nullprt;
std::cmatch result
if (std::regex_search(search, result, re)){
matched = result.str; //str or str(0) is the full match.
}
//...
Note that each expression engine may differ slightly on how they behave. For example, Matlab's regex is the only engine I've come across where . also matches a \n by default. Your regular expression is sufficicently basic that it should behave the same in all engines.
I'm not entirely sure what you're trying to match with your regex. It doesn't match anything in your example text. The alternation probably slows the regular expression (as it will force backtracking).
But at the end of the day, if you're parsing 50 GB of text data, it will take time regardless of which language it's in. That's a lot of text!
Guillaume
el 27 de Mzo. de 2019
I know nothing about coder, don't have the toolbox. Reading the documentation of coder.ceval, I find it very incomplete. It mentions that you can call C and C++ functions but only show example for calling C functions, not C++ functions, so it's really not clear if it support calling pure C++.
Despite the similarity C and C++ are two different languages. C++ functions use a different calling convention and support a lot more types as inputs and outputs than C. Now, you can make a C++ function compatible with and callable from C, if you prefix it with an extern "C", however, you're then limited to C return values which are basically indeed just standard integer or floating point types or plain pointers (to struct, arrays, etc.).
Considering that I haven't seen a single example of C++ in the various coder documentation pages that I've looked at, I think that their constant use of C/C++ is misleading, it looks like coder only understand C interfaces. So you would have to replace your std:vector inputs and outputs by pointers (and looking at the doc wrap these in coder.ref or coder.rref for inputs and coder.wref for outputs). So your function would become:
void regex_var(const char* str, const char* pattern, char** matches){ //does matlab use char or wchar?
//can use std::vector within the function but in-out must be C
//...
}
and in matlab it would be something like:
coder.ceval('regex_var', coder.rref(str), coder.rref(pattern), coder.wref(matches));
What I can't figure out from the doc is how you manage the allocation of matches. All examples show preallocation of the output in matlab, which of course doesn't work if you don't know how many matches you're going to get, or the length of each match. It looks to me that coder does not support functions that return variable size arrays.
Again, I know nothing about coder, so take all this with a grain of salt.
0 comentarios
Ver también
Categorías
Más información sobre MATLAB Coder en Help Center y File Exchange.
Productos
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!