Preconditioning for iterative solvers on GPU - Performance issues
Mostrar comentarios más antiguos
Dear all,
I'm experimenting some preconditioners for iterative solvers on GPU in a linear system [A]{x}={B}. The problem is defined by this simple command line:
sol=pcg(A_gpu,B_gpu,tol,maxit,P)
where A and B are gpuArrays and P is the preconditioner.
Some simple tests point out that the solution is faster than any iterative CPU solver, whenever P=[ ], with speedups up to 12x;
However, what I still can't figure out, is the reason why the performance drops whenever any type of preconditioner is selected. For an instance, using Incomplete Cholesky factorization:
L=ichol(A)
sol=pcg(A_gpu,B_gpu,tol,maxit,L*L')
Blows out the performance when compared to no preconditioner at all on the GPU. The solution is even slower than the CPU version, where this same preconditioner improves the CPU performance by 1.5x. That's really strange.
I've also tried passing A_gpu as preconditioner, but the solution takes forever:
sol=pcg(A_gpu,B_gpu,tol,maxit,A_gpu)
This issue is also related to other iterative solvers, such as: BICG and SYMMLQ
Am I doing something wrong? It appears that any preconditioner on the GPU is acting as a drawback, even when it is efficient for the CPU version.
Please share your thoughts and experiences. Thanks!
7 comentarios
Walter Roberson
el 14 de Nov. de 2019
Remember, that GPU processing does not correspond exactly to CPU processing. Users make top-level calls, and MATLAB can use any GPU implementation it deems suitable, not necessarily the same one that would be used on CPU. In particular, MATLAB can make use of third-party pre-tuned GPU libraries that might not have been designed with pre-conditioners in mind.
Joss Knight
el 15 de Nov. de 2019
Does it take more iterations or is each iteration slower?
On the GPU the preconditioning method is via no-fill ILU which can be slow. What you lose in start-up overhead you are supposed to gain in convergence speed, i.e. it reduces the number of iterations. But it is problem-dependent. It would help if you could provide an example A and B for me to try.
Paulo Ribeiro
el 15 de Nov. de 2019
Editada: Paulo Ribeiro
el 16 de Nov. de 2019
Joss Knight
el 21 de Nov. de 2019
The explanation as to why certain preconditioners work or do not work is beyond my expertise. I know that preconditioning on the GPU uses a different algorithm and so would expect different behaviour than the CPU.
Your NVIDIA RTX 2080 SUPER does not have good double precision performance.
At 316 GFLOPS it is 32x slower than its single precision performance and, likely, slower than your CPU for the kinds of hybrid computations that are happening in the iterative solvers - perhaps in particular for computing the ILU. I would recommend using your CPU when you are using a preconditioner.
Joss Knight
el 21 de Nov. de 2019
For what it's worth, these are the results I got for your data on a Titan V, which has around 7 TFLOPS in double precision. I saw similar issues for passing the reconstructed cholesky or ILU factors - I can't explain that but perhaps the sparsity pattern is just a really poor match for the GPU factorization algorithm. We do intend to provide a future enhancement that will allow two triangular preconditioners to be passed to the solver so that the decomposition can be done independently.
>> Ag = gpuArray(A);
>> Bg = gpuArray(B);
>> P = diag(diag(A));
>> tic; pcg(A,B,1e-5,6000); toc
pcg converged at iteration 5346 to a solution with relative residual 1e-05.
Elapsed time is 25.906744 seconds.
>> tic; pcg(Ag,Bg,1e-5,6000); toc
pcg converged at iteration 5345 to a solution with relative residual 1e-05.
Elapsed time is 1.399854 seconds.
>> tic; pcg(A,B,1e-5,6000,P); toc
pcg converged at iteration 5501 to a solution with relative residual 1e-05.
Elapsed time is 34.181677 seconds.
>> tic; pcg(Ag,Bg,1e-5,6000,P); toc
pcg converged at iteration 5502 to a solution with relative residual 9.8e-06.
Elapsed time is 2.404074 seconds
In other words, preconditioner or no, the GPU is giving a great performance improvement.
Paulo Ribeiro
el 21 de Nov. de 2019
Editada: Paulo Ribeiro
el 22 de Nov. de 2019
Joss Knight
el 25 de Nov. de 2019
I investigated further and found that applying the preconditioner - not just decomposing it - does appear to be taking an unusually long time. This does warrant further investigation, since these two triangular solves should be fast, and your system matrix is band-diagonal. It does have quite a large bandwidth of 543 however, so that could be the issue.
Iterative solvers are always faster than direct solves for large sparse matrices (assuming they have reasonable convergence properties). Direct solves are hugely memory intensive because there is a lot of fill-in during factorization.
Respuestas (0)
Categorías
Más información sobre Parallel and Cloud en Centro de ayuda y File Exchange.
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!