Computer to boost MATLAB operation
31 visualizaciones (últimos 30 días)
Mostrar comentarios más antiguos
Victor R
el 4 de Feb. de 2021
Comentada: Paul Hoffrichter
el 9 de Feb. de 2021
Hi all!
I am thinking about buying a new PC for my MATLAB operations, and I was wondering if there are any recomended CPUs or specific architectures that will reduce calculation time. Until now I was buying consumer-level i9 processors, but the computational load of my functions is dramatically increasing and I was wondering if there is a better solution.
PS: I always use the CPU as my functions are often iterative, thus transferring those operations to the GPU is not feasible.
Thanks in advance!
EDIT: I will add some information as suggested by @Jason Ross
My budget is approximately 2k, but if it will improve the performance I am able to save money and reach 5-7k. I would like to know if buying a 5k computer will outperforms 5 1k computer
In my case, I have to use nested loops with matrix operations (calculating adjacency matrices, and graph parameters on that matrices).
I use the Parallel Computing Toolbox, in the last loop
Now, I have 9 user-level computer, that I use simultaneuosly (each one ~1500€) but I feel that I have wasted my money and that a better solution could be reached, that outperforms the 9 PC together.
Also, due to restrictions on my budget (the way I spend it) I couldn't pay for computer time in cloud services.
5 comentarios
Walter Roberson
el 5 de Feb. de 2021
If I recall correctly, Google does not operate most of its servers on high end computers. Instead it operates on less expensive lower-value computers but lots of them, having invested a lot of effort into fault-tolerant systems that can automatically yank failing systems offline. Tens of their computers fail every day, but they designed with that in mind. If 3 out of 10 systems fail today, then you still get progress from seven out of the ten; if you had instead gone for two "5 times as good" computers then when one fails, you have lost half your capacity.
Some of the really big computer challenges have been dealt with by operating tens of thousands of home computers... on screensavers.
When you can get lots of computers together, the limits start to be communication, and finding a way to partition the computations into chunks that are at most a couple of days on lower-end computers "after-hours" but meaningful for fast computers. See BOINC and SETI@HOME and various distributed protein folding challenges...
Respuesta aceptada
Paul Hoffrichter
el 7 de Feb. de 2021
Editada: Paul Hoffrichter
el 7 de Feb. de 2021
As @Jan pointed out, it is very important that you run your profiler to find out where your time is being spent and focus on those areas; and remove recursion if that is a major cpu hog. Additionally, if you are doing I/O in the middle of a loop, pay special attention to that to keep the core out of a wait state.
"Tip: Always run the outermost loop in parallel, because you reduce parallel overhead."
>> I have 180 subjects on a NAS, each PC computes 20
Ok, I think I understand better now about your 9 (=180/20) computers. They are solving independent problems where independence is defined by Subject at the top-level. Your 20 subjects per PC should be highly parallelizable since there is no interaction or memory sharing amongst the subjects. I guess that after the 9 computers complete their task, you bring the data together.
>> Separating the computations in this way makes not necessary the use of Parallel Server, does it?
These 9 computers are not talking to each other during the processing of 180 subjects, right? If so, then you do not need the Parallel Server. The Parallel Server is to computer clusters (or cloud server clusters) as the Parallel Computing Toolbox is to computer cores or GPU cores. That is, both make your programming job easier if you follow some rules (albeit sometimes tricky rules that are not easy to diagnose as most of us have already experienced. If you have a maintenance contract with MathWorks, they will be able to help with screen-sharing).
If you purchase 9 more computers, you should halve part of your workflow time since each computer only needs to process 10 subjects instead of 20. Of course this assumes little contention in the NAS system.
If you purchase larger core CPUs, that could also make an improvement provided that you increased your memory by proportional amounts. But surprisingly, our 32 core server outperformed our 70+ core server. The possible reason was that the latter was using NUMA architecture.
As an aside, you should know that using parfor will produce slightly different results if using float or double. Even a single built-in MATLAB function can return different results. I demonstrated this to MathWorks with a short script. They said this is not a bug and there would be no fix. I admit that trying to solve this problem is not easy, but it is annoying, and I consider it a bug. (It is somewhat analogous to Intel saying that there floating point operation was slightly off, but that should not concern most of the users due to very small errors. Their stock went down 5% the next day, and then they said they would fix the problem.) (If using only integer or fixed-point arithmetic, your results should be the same when using parfor.)
>> I use the parfor in the last one (nn = Q) otherwise, I get an error.
This is important that you fix the error. Please review parfor variable classification to better understand your parfor errors when you parfor at a higher level. Each core can handle one subject nicely in a for-loop.
From this link is a notable quote:
“If you run into variable classification problems, consider these approaches before you resort to the more difficult method of converting the body of a parfor-loop into a function.”
It was not difficult for me to do this, and maybe it will not be so difficult for you. Give this a try.
parfor ii = 1:length(subject)
processSubject( subject(ii) );
end % END parfor
Keep in mind that since you are already highly vectorized, then parfor will not have as much a dramatic effect if you had just written unoptimized code; but it will still help, maybe 3x-5x rather than 20x.
In C/C++ HPC environments, cache misses are a major concern. Your vectorization very likely is doing a good job. (In the past, a Windows task monitor could show 100% usage even though a CPU was in a wait state waiting for a word to be retrieved when caused by a cache or page miss.) In Linux, there are tools (e.g., pahole, cachegrind) that can identify cache problems. As you know, you pay a lot for higher level cache.
Which brings up C/C++. As you know, MATLAB can automatically convert your MATLAB code to C++. (There are two optional packages for that, and the claim is that you can get a 3-5x improvement. I believe that you can get a trial version for them.) But not all functions are automatically able to be converted so learning MEX may be required. And to get proper C++ vectorization, you need to have the libraries and learn the APIs for MKL and IPP. (I got a 15x improvement by changing only a few lines of code in the innermost 10-line function by switching to IPP primitives taking advantage of SIMD.) Do a profile and find out where your time is being spent, and then decide if you can use C/C++ MEX. (Now, latest versions of the C++ valarray library does use SIMD, and valarray makes it easier to convert MATLAB into C++.)
A word on GPU: "Gathering back to the CPU can be costly, and is generally not necessary unless you need to use your result with functions that do not support gpuArray." Based on your notes, it appears that you will have to do careful analysis after profiling your program to see whether GPU will prove beneficial.
"However, people have been reporting that current versions of MATLAB are not able to use the full power of MKL (Math Kernel Library) equivalents, possibly due to the way that Intel wrote some tests of CPU capabilities into the code."
Walter gives a positive explanation as to why MKL did not boost AMD CPU as much as expected; but the story is darker. Intel was sued because they hid the fact that they purposely put in code in MKL to detect AMD product and ensured that it would run as well as on an INTEL CPU. I forget the outcome of the lawsuit, but I think it was that Intel had to disclose up-front that a purchase of MKL would not have the desired benefit on AMD CPUs that are reported on INTEL. There are many articles on this subject. Here is one that I quickly found and there may even be a work-around.
If you determine that GPU is not going to help you, then consider this question:
When is an i9 processor worth the money?
Answer:
“They are generally not really worth the money at all.”
- AMD Threadripper 16-core 32 threads $880
- Intel i9-7960X 16-core 32 threads $1725
4 comentarios
Paul Hoffrichter
el 9 de Feb. de 2021
Make sure that this AMD core problem is legit and if so, fixed before going to AMDs.
Más respuestas (3)
Walter Roberson
el 5 de Feb. de 2021
If all your cores are 95-100% then your code is well vectorized automatically. In such a case, your individual core speed might not be the most important but the aggregate speed might be important.
The AMD Ryzen CPUs have lower per-core speeds than some of the other available systems, but they have excellent aggregate scores, as they can have a lot of cores. They would not be the first choice if your tasks were mostly sitting in single cores, but they can be very nice for tasks that use a lot of linear algebra or straight-forward vectorization.
The AMD enterprise CPUs, E9xx, are designed for enterprise class systems -- longer lasting, better power control, higher standards on dies, more overclocking potential (requiring better cooling.) However, people have been reporting that current versions of MATLAB are not able to use the full power of MKL (Math Kernel Library) equivalents, possibly due to the way that Intel wrote some tests of CPU capabilities into the code. There is a hypothesis that performance could be improved dramatically by setting a particular environment flag, but I have not heard back from anyone who has tried setting the flag.
3 comentarios
Jan
el 5 de Feb. de 2021
Editada: Jan
el 5 de Feb. de 2021
You can increase the computing power by a factor of 2 if you invest a lot of money. Improving the code can gain speedups of a factor 100 or 1000. This is just the theory, but I have not seen a code yet, which could not be accelerated in any way. Sometimes it is just a question of pre-allocation or processing the data columnwise instead of rowwise.
So before you spend a lot of money today, ask a professional programmer for improving your code. This can help even if you are an experienced programmer also: A higher programming skill allows to see the underlying idea, when you read code. This can cause a blindness for typos and mistakes.
But, of course, sometimes the code is almost optimal already and there are not mathematical shortcuts anymore. Your analysis of the limitting factors of your hardware looks professional. My professor told me: "If you need a faster computer, just wait a year. If you need faster code, do it now."
2 comentarios
Jan
el 6 de Feb. de 2021
Matlab is not efficient for recursive algorithms. All recursive algorithms can be converted to iterative loops, which can be a massive speedup. See:
- https://stackoverflow.com/questions/159590/way-to-go-from-recursion-to-iteration
- https://www.refactoring.com/catalog/replaceRecursionWithIteration.html
- https://www.cs.odu.edu/~zeil/cs361/latest/Public/recursionConversion/index.html
It is worth to identify the bottleneck using the profiler and to post the code hiere in the forum. You do not loose the option to buy stronger hardware.
Paul Hoffrichter
el 5 de Feb. de 2021
Where is your time being spent - in the nested loop, or in the last loop that is outside the nested loop?
If, by "last loop", you mean the innermost loop, then that is not as good as using the parallel toolbox at a higher level, preferably at the highest level. If you are using parfor, there is overhead each time you enter that parfor loop, and hitting it hard in the innermost loop can subtract a good deal from your performance.
If you think you can break up your program into multiple chunks that can be distributed to PCs without having a large amount of messaging over the LAN, then that is certainly a reasonable option. The LAN runs slowly, so that is a concern. This is something that you can determine in a simulation before purchasing. The fact that you say you cannot use GPUs makes me think that the distributed PC option may not work out for you.
@Jason RossTo access a local (or cloud) a compute cluster, having the Parallel Computing Toolbox is not enough. It is also required to get the Parallel Server.
4 comentarios
Walter Roberson
el 6 de Feb. de 2021
GPU use requires NVIDIA GPU.
GPU use is more efficient under Linux instead of Windows: the architecture imposed by Windows requires reserving a hunk of memory and extra transfers.
The efficiency of different models of NVIDIA GPU can vary quite a bit depending whether you are doing single precision or double precision, and exactly which model you are using. Double precision performance can be 1/32 of single precision (common!), 1/24 of single precision (not rare), 1/8 of single precision (specialized), or 1/2 of single precision (if I recall correctly; very specialized and expensive.) You really have to look very carefully at specifications if you are using double precision: an amazingly fast new generation GPU can turn out to be slower for double precision than "just the right model" of two generations before. Sometimes you have to dig a lot to figure out the double precision performance of a particular model.
For various reasons, you would prefer your GPU to have at least roughly 3 times as much memory as your largest input matrix -- with the flip side of that being that you should not count on being able to use input matrices more than roughly 1/3 of available memory.
Synchronization to start or stop a computation is one of the slowest parts, so ask "bigger questions" to amoratize the costs over time. But at the same time, memory transfer is part of that cost, so ideally you would like to transfer in as little as possible and transfer out as little as possible, while having the computation be meaningfully large.
Specific indexing is an expensive operation on GPU, so for loops that work by indexing individual locations are not effective use of resource (and will probably turn out slower than CPU.) Vectorize! Vectorize! Vectorize! Ideally entirely arrays.
Ver también
Categorías
Más información sobre Parallel and Cloud en Help Center y File Exchange.
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!