Failed to generate large CUDA kernel in GPU coder with FFT function inside
9 visualizaciones (últimos 30 días)
Mostrar comentarios más antiguos
I am trying to get my code paralle in GPU.
I have converted the code with the "main.m" script as attached. But the mex code on GPU is much slower than the m code on CPU. I understand that the GPU is not suitable for such small data size. But it takes much much longer time on the GPU if bigger data size is used.
Then I check the profilling timeline. I find that many cuda kernel is created and the overall GPU utilization is low. After some debugging, I find that if the fft command is used, the GPU coder failed to generate large CUDA kernel.
I think that the perfermance can be improved significantly if the fft can be incoporate inside one CUDA kernel like the situation without fft. FFT is needed. I have try to search on Google, but nothing relative can be found. Can you provide any information about this or any solution? The output of gpuDevice is also provided in the attachment.
Here is the profilling timeline without fft.
Here is the profilling timeline with fft.
0 comentarios
Respuestas (1)
Justin Hontz
el 18 de Sept. de 2024
Hi He,
In your M-code for RandCopy, the for loop cannot be executed as a GPU kernel (even with the coder.gpu.kernel pragma) because of the fft / ifft calls inside of the loop. This is because fft is implemented using its own specialized GPU kernel, and GPU Coder does not supported nested kernels execution. Consequently, the for loop runs sequentially, which explains why you see thousands of small kernel instaces within the performance analyzer timeline graph.
To improve the performance of your code, you will want to perform your computation using only a single fft / ifft call that operates on the entire input array instead of individual slices. Something like this should work:
Tmp = fft(Data,[],2);
Tmp = Tmp + (1 + 1i);
Tmp = Tmp * (1564 + 798i);
Data = ifft(Tmp,[],2);
After making the change on my end, the performance analyzer report shows a significant performance improvement, with the timeline graph looking similar to the original one without fft.
4 comentarios
Justin Hontz
el 19 de Sept. de 2024
GPU Coder currently does not support generating direct calls to the cuFFTDx API. With that said, however, you may still be able to indirectly call into the API in the generated code if you are willing to write your own CUDA wrapper function that directly uses the API. This can possibly be achieved by invoking the wrapper function inside the for loop of your M-code via coder.ceval. The call would look something like this:
coder.ceval('-gpudevicefcn', 'myFFTWrapper', coder.ref(data), ...);
The -gpudevicefcn flag indicates that the wrapper function is meant to be executed by a GPU thread rather than by the CPU.
Note that I have not tried using this approach on my end, so I cannot guarantee that such an approach would work correctly without issue.
Ver también
Categorías
Más información sobre Kernel Creation from MATLAB Code en Help Center y File Exchange.
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!