PTX kernel time to run

2 visualizaciones (últimos 30 días)
Gaszton
Gaszton el 16 de Mayo de 2011
Hello, i am using R2010b, CUDA toolkit 3.1 with a geforce gt425m. While is was optimalizing my cuda code i observed that calling the kernel with feval in matlab has a ~2ms constant time measured with
tic feval(k,...) toc
the kernel code:
#define C_WIDTH 1024
#define C_HEIGHT 768
__global__ void timetest1(float* holo) {
int mindex=blockIdx.x*blockDim.x+threadIdx.x;
int size=C_WIDTH*C_HEIGHT;
if (mindex>=size)
return;
holo[mindex]=mindex*mindex;
}
Even if i take out the write to global memory //holo[mindex]=mindex*mindex; there is a ~2ms time
Does anybody know the origin of this lag? It would be great to somehow eliminate it.
Thanks,
Gaszton
PS: my matlab code for the kernel:
clear
import parallel.gpu.GPUArray
xsize=1024; ysize=768;
vectorsize=xsize*ysize; threadpblock=1024; k=parallel.gpu.CUDAKernel('TimeTest.ptx', 'TimeTest.cu'); k.ThreadBlockSize=[threadpblock,1,1]; k.GridSize=[ceil(vectorsize/threadpblock),1];
dholo=parallel.gpu.GPUArray.zeros(vectorsize,1,'single');
tic [dholo]=feval(k,dholo); time=toc;
['ms time= ' num2str(time*1000)]
clear

Respuesta aceptada

Edric Ellis
Edric Ellis el 16 de Mayo de 2011
Firstly, can I suggest that if possible you should upgrade to R2011a as we have made quite a few performance improvements in that release. Secondly, I think the main bottleneck in your code as written is that outside a function, an important optimisation called "in-place optimisation" cannot take place. If you place your code inside a function, then "dholo" will not be copied. For reference, I made a function like this:
function tmp
import parallel.gpu.GPUArray
xsize=1024; ysize=768;
vectorsize=xsize*ysize;
threadpblock=512; % I have a C1060
k=parallel.gpu.CUDAKernel('TimeTest.ptx', 'TimeTest.cu');
k.ThreadBlockSize=[threadpblock,1,1];
k.GridSize=[ceil(vectorsize/threadpblock),1];
dholo=parallel.gpu.GPUArray.zeros(vectorsize,1,'single');
tic
for ii = 1:1000
dholo=feval(k,dholo);
end
time=toc;
disp(['ms time= ' num2str(time)])
And the overhead on my C1060 was down to 0.05 ms.
  1 comentario
Gaszton
Gaszton el 16 de Mayo de 2011
Thank you for your help!
I am a PhD student in Hungary, Biological Research Centre
Hungarian Academy of Sciences,
we have a network licence (with limited number of instances of matlab to run parallel)
We used to buy a matlab update in every 1-2 year, but i dont really have an impact on that.
thank you again,
Gaszton

Iniciar sesión para comentar.

Más respuestas (0)

Categorías

Más información sobre GPU Computing en Help Center y File Exchange.

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by