Preconditioning for iterative solvers on GPU - Performance issues

Dear all,
I'm experimenting some preconditioners for iterative solvers on GPU in a linear system [A]{x}={B}. The problem is defined by this simple command line:
sol=pcg(A_gpu,B_gpu,tol,maxit,P)
where A and B are gpuArrays and P is the preconditioner.
Some simple tests point out that the solution is faster than any iterative CPU solver, whenever P=[ ], with speedups up to 12x;
However, what I still can't figure out, is the reason why the performance drops whenever any type of preconditioner is selected. For an instance, using Incomplete Cholesky factorization:
L=ichol(A)
sol=pcg(A_gpu,B_gpu,tol,maxit,L*L')
Blows out the performance when compared to no preconditioner at all on the GPU. The solution is even slower than the CPU version, where this same preconditioner improves the CPU performance by 1.5x. That's really strange.
I've also tried passing A_gpu as preconditioner, but the solution takes forever:
sol=pcg(A_gpu,B_gpu,tol,maxit,A_gpu)
This issue is also related to other iterative solvers, such as: BICG and SYMMLQ
Am I doing something wrong? It appears that any preconditioner on the GPU is acting as a drawback, even when it is efficient for the CPU version.
Please share your thoughts and experiences. Thanks!

7 comentarios

Remember, that GPU processing does not correspond exactly to CPU processing. Users make top-level calls, and MATLAB can use any GPU implementation it deems suitable, not necessarily the same one that would be used on CPU. In particular, MATLAB can make use of third-party pre-tuned GPU libraries that might not have been designed with pre-conditioners in mind.
Does it take more iterations or is each iteration slower?
On the GPU the preconditioning method is via no-fill ILU which can be slow. What you lose in start-up overhead you are supposed to gain in convergence speed, i.e. it reduces the number of iterations. But it is problem-dependent. It would help if you could provide an example A and B for me to try.
Paulo Ribeiro
Paulo Ribeiro el 15 de Nov. de 2019
Editada: Paulo Ribeiro el 16 de Nov. de 2019
Hi Joss and Walter.
I'm providing an Onedrive link for [A] and {B}:
The mat file exceeds 5MB and cannot be attached to this message. For this specific benchmark there's not a single case where a preconditioner in the GPU provides better performance than a run with no preconditioner at all.
Benchmarks are based on the following test:
P=diag(diag(A));
A=gpuArray(A);
B=gpuArray(B);
sol=pcg(A,B,1e-5,1e5,P);
Some comments:
a) when P=[ ] there's convergence with 3.26s in 5346 iterations (that's the GPU with no preconditioner case).
b) when P is set to diag(diag(A)) there's convergence in 8.73s in 5501 iterations.
c) using BICG as the iterative solver with P=[ ] provides convergence with 4.90s in 5395 iterations. When P is set to diag(diag(A)) there's convergence in 16.58s in 5567 iterations.
d) using Incomplete Cholesky factorization with:
L=ichol(A);
P=L*L';
blows out the performance in both methods (PCG and BICG) and processing time is greater than 300s. I cancelled this operation. On the other hand, this same preconditioner provides a significant speedup (1.6x) using the CPU solver.
e) ILU preconditioner is also a problem.
f) in my experiments there's not a single case where a preconditioner provides better results than the scenario with P= [ ].
g) my current setup is a NVIDIA RTX 2070 SUPER in a Windows 10 environment using MATLAB R2019a.
h) [A] is a sparse and symmetric matrix, diagonal dominant.
Many thanks for your support. Hope that your experience can help me on this issue. Regards.
PS: I wonder if preconditioning prior to the iterative solver call will provide better performance.
The explanation as to why certain preconditioners work or do not work is beyond my expertise. I know that preconditioning on the GPU uses a different algorithm and so would expect different behaviour than the CPU.
Your NVIDIA RTX 2080 SUPER does not have good double precision performance.
At 316 GFLOPS it is 32x slower than its single precision performance and, likely, slower than your CPU for the kinds of hybrid computations that are happening in the iterative solvers - perhaps in particular for computing the ILU. I would recommend using your CPU when you are using a preconditioner.
For what it's worth, these are the results I got for your data on a Titan V, which has around 7 TFLOPS in double precision. I saw similar issues for passing the reconstructed cholesky or ILU factors - I can't explain that but perhaps the sparsity pattern is just a really poor match for the GPU factorization algorithm. We do intend to provide a future enhancement that will allow two triangular preconditioners to be passed to the solver so that the decomposition can be done independently.
>> Ag = gpuArray(A);
>> Bg = gpuArray(B);
>> P = diag(diag(A));
>> tic; pcg(A,B,1e-5,6000); toc
pcg converged at iteration 5346 to a solution with relative residual 1e-05.
Elapsed time is 25.906744 seconds.
>> tic; pcg(Ag,Bg,1e-5,6000); toc
pcg converged at iteration 5345 to a solution with relative residual 1e-05.
Elapsed time is 1.399854 seconds.
>> tic; pcg(A,B,1e-5,6000,P); toc
pcg converged at iteration 5501 to a solution with relative residual 1e-05.
Elapsed time is 34.181677 seconds.
>> tic; pcg(Ag,Bg,1e-5,6000,P); toc
pcg converged at iteration 5502 to a solution with relative residual 9.8e-06.
Elapsed time is 2.404074 seconds
In other words, preconditioner or no, the GPU is giving a great performance improvement.
Paulo Ribeiro
Paulo Ribeiro el 21 de Nov. de 2019
Editada: Paulo Ribeiro el 22 de Nov. de 2019
Thanks Joss. These are really impressive results on a Titan V. It's even faster than a backslash solver A\B on the CPU with an Intel i7 8700:
tic; A\B; toc
Elapsed time is 1.712258 seconds.
For this specific case it appears that the best option is to avoid preconditioning on the GPU.
Regards.
I investigated further and found that applying the preconditioner - not just decomposing it - does appear to be taking an unusually long time. This does warrant further investigation, since these two triangular solves should be fast, and your system matrix is band-diagonal. It does have quite a large bandwidth of 543 however, so that could be the issue.
Iterative solvers are always faster than direct solves for large sparse matrices (assuming they have reasonable convergence properties). Direct solves are hugely memory intensive because there is a lot of fill-in during factorization.

Iniciar sesión para comentar.

Respuestas (0)

Productos

Versión

R2019a

Preguntada:

el 14 de Nov. de 2019

Comentada:

el 25 de Nov. de 2019

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by