MATLAB Answers

0

Speeding up code: pre-allocation, vectorization, parfor, spmd....

Asked by Matlab2010 on 6 Feb 2013
Latest activity Edited by Chad Greene
on 11 Jul 2019
I have written a very tricky and large bit of code. Its processing a data set of 5 million values. In outline the code goes like this.
1. Outer parfor loop (1: 500K).
2. Next loop (1: ~100)
3. Test lots of conditions
4. Inner for loop. 1: K. Assign values to a growing cell array, where K is the length of it. Each cell contains a struct, which in turn contains cell arrays (Its a really high dimensional data set!).
The problem is that it currently takes 6 seconds to carry out one run of the outer loop. I need to dramatically speed it up. (24 hours run time would be ok. 24 mins would be better :)).
I have used the profiler extensivly, and other than a warning telling me to pre-allocate, it all looks ok.
Stuff I am NOT currently doing:
1. pre-allocate the cell object. The reason is that I would need to search through the object to find which values are active or not each time I accessed the object. I assume this would take more time than would be saved by pre-allocation.
2. Vectorization. This is what I normally use when possible. However, this is such a complex bit of code, with many loops inside loops I dont even know where to start. Any hints?
3. I have the parallel toolbox, though only one 64 bit machine with enough RAM to load the dataset. Should I be using this? I have not done so before.
I am using win7 on a quod core machine with 16GB.
Any sensible comments welcome. thank you.

  2 Comments

  • A too pessimistic pre-allocation is usually much cheaper than letting a an array grow repeatedly. Either allocate to many elements initially or let the array grow by 1000 elements on demand to reduce the effects.
  • Try to vectorize the innermost loop only. But in most cases I've seen here in the forum, eliminating repeated calculations has been a great advantage already.
  • Find the bottlenecks and use PARFOR to use all cores of your machine.
quick update.
The loop time has decreased from 6s to 0.05s by pre-allocating the complex cell array/structure object.
thank you for your comments so far. I will update in much more detail next week.

Sign in to comment.

5 Answers

Answer by Jonathan Sullivan on 6 Feb 2013

It's really hard to say what exactly you can do without knowing the nature of what is inside those for-loops.
Some ideas come to mind:
  • Eliminate redundant calculations
  • Vectorize any operations that you can
  • Avoid using things like repmat and use things like bsxfun, where possible
But post your code. It's really hard to say whats applicable without seeing it.

  1 Comment

+1 for "Eliminate redundant calculations", while "Vectorize any operation" might be counterproductive, when the required temporary arrays are more expensive than the overhead for running loops.

Sign in to comment.


Answer by Jason Ross
on 6 Feb 2013
Edited by Jason Ross
on 6 Feb 2013

Check using the Resource Monitor to see if you are swapping to disk while you are running. Given that you say you have enough RAM to load the data set in memory, I'm wondering if you also have enough RAM to deal with everything else that's going on. If that's the case, get more RAM.
It might also be useful to look at CPU utilization rates as well to see if they are pegged out. When you say you have four cores, are those compute cores or the "hyper-threaded" ones?
As for the parallel toolbox, if you only have one machine available, and you are taking up all the RAM with your existing program, then parallelization isn't likely to generate a speedup. But if you look at your RAM utilization and CPU utilization and could fit multiple copies of your program on the machine, it might help -- but if you can't, you'll likely end up going slower.
If you are doing a lot of disk I/O, a SSD will trump a "traditional" hard drive for performance and access time.
But it's likely that the above should be considered when you know your program is as tight as it can be. Sometimes throwing money at a problem works, but not always.

  0 Comments

Sign in to comment.


Answer by Dan K
on 6 Feb 2013
Edited by Dan K
on 6 Feb 2013

Make sure that it is a function and not a script... Huge speed difference there. When you pre-allocate, make sure your allocating enough for the eventual usage. I've seen cases where one pre-allocates, fills it up, then adds a little more to the end.
A few more thoughts...
If you are making many calls to a simple subroutine consider putting it inline, rather than calling it over and over.
If there is a particular computation that is really the bottleneck, you could consider mex-ing it.
hope it helps.
Dan

  0 Comments

Sign in to comment.


Answer by Jan
on 6 Feb 2013

Concentrate to optimize the bottlenecks only. When you spend hours to improve some code, which occupies 2% of the total processing time, an acceleration of the factor 1000 leads to a almost 2% faster program also. So use the profiler and better some tic tocs to locate the bottle necks at first.
Unfortunately the profile disables the JIT acceleration, because the JIT can change the processing order of lines, while the profile must measure the lines in the original order.

  2 Comments

Really? Thank you Jan, that is something I didn't know! Do you know any way to get an accurate measure of the impact of JIT on code?
feature('JIT', 'off')
feature('accel', 'off')
Then TIC/TOC the function again. Re-enabling is straight forward.
It should be explained at a very prominent location, that the tool to measure the performance influences the efficiency of the code execution substantially. This is a massive design error of the profiler.

Sign in to comment.


Answer by Sean de Wolski
on 7 Feb 2013

I'll also throw in the use of MATLAB Coder to generate mex files from various pieces of code. Although the milage will vary, you can sometimes see a pretty good speed-up with the MEXed version.

  0 Comments

Sign in to comment.