Help reducing huge time overhead when executing system() or unix() commands from Matlab function

I have a data processing program that goes through gigabytes of .csv text. The approach I've been taking is to use grep (Mac or Linux) to break up my massive data files into digestible files.
Here's the catch: I have to do it a few hundred times. No biggie, grep is fast, I'll knock this out in no time. A test from bash:
Sweet! 1.6s per file and I'll have this done before I finish my coffee break.
Well....
Holy Toledo, that makes a huge difference. My pre-processing is now taking multiple hours per data file. I have tried to look at Matlab environment variables and even dipped my toe into calling through java (checkout jsystem on the exchange!)
Still no luck. I can have my function generate a bash script and then I can run the script manually, I suppose, but I need this to work for other people in the office who are less command line savvy (that's saying something!)
Can anyone shed some light on what's going on or point me towards an improved time?
Thanks a bunch!
% I ran this example on a 3.23 gig text file.
% 16,706,909 lines
% 3,225,789,919 characters
[~, r] = unix('time LC_ALL=C grep -F "My Data Tag String" "/Path/To/Giant/Data/File.csv" > output.csv')

2 comentarios

You could have it generate a script that had all of the commands in it to do all of the splitting, and then do a single system() -- thus getting the overhead only once.
True, the single script could be run from my function. I wasn't sure if the overhead is just on spawning each instance, or if there's something deeper at work that would put me in the same position.
Also, it hurts to not have user feedback - my current implementation has a progress bar so my users believe something is happening!

Iniciar sesión para comentar.

 Respuesta aceptada

When I cd into a directory with a fair number of files, and
!time ls
and compare the timing to executing ls in a shell when cd'd to the same directory, then the !time ls version is showing up as slightly faster.
So the next thing for you to test: are you getting the same grep in each case?

4 comentarios

Walter, I owe you a beverage of your choosing.
I added gnu grep to our macs because it's significantly faster than the bsd version they ship with... My shell's environment variable aren't carried into the system commands, are they.
So I have to either explicitly call /usr/local/bin/grep or set my path
I'll give it a whirl!
When you start MATLAB via icon instead of command line then it does not go through login shell so bashrc and so on are not processed.
I have posted a few times about exactly what is processed at login time and what is processed when an interactive shell starts, and linked to how to set MAC environment variables for all processes (involves a plist and launchctl )
Thanks, I'll dig up your posts - seems like I should read them!
Also, I just split a 3 gig file into 37 pieces in the time it took me to do 1 previously...
Man, that gnu grep
Thanks again!

Iniciar sesión para comentar.

Más respuestas (0)

Categorías

Más información sobre Startup and Shutdown en Centro de ayuda y File Exchange.

Preguntada:

el 17 de Nov. de 2017

Comentada:

el 17 de Nov. de 2017

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by