How to make my GPU interpolation code faster?

Question

1 voto

I am doing a lot of linear interpolation for a project. MATLAB's griddedInterpolant function for the CPU is fast, and their overloaded function interpn for the GPU is even faster. However, the structure of my code is that I need to do linear interpolation in a for loop where the points to be interpolated are calculated at each loop. Every loop is independent, so I coded my own linear interpolation function to run on the GPU using arrayfun. My function is faster than griddedInterpolant but slower than interpn. I was wondering if anyone could see a way to speed up my function. Or do you think it would benefit from compiling it as a GPU MEX function (I have never done this and it would be a big time cost for me, so I would only like to do it if someone knows it would help a lot). Note that my data does not go off the grid and my grid is uniformly spaced in each dimension. My code is below (sorry for the appearence, it doesn't fit properly. You can find my function of FEX at http://www.mathworks.com/matlabcentral/fileexchange/47437-interp3gpu-m):

function Vi=interp3gpu(x1,x2,x3,V,x1i,x2i,x3i)
Nx1 = length(x1);
Nx2 = length(x2);
Nx3 = length(x3);
x11 = x1(1);
x21 = x2(1);
x31 = x3(1);
s1 = x1(2)-x11;
s2 = x2(2)-x21;
s3 = x3(2)-x31;
Vi = arrayfun(@gpuInterp,x1i,x2i,x3i);
    function Vi=gpuInterp(x1i,x2i,x3i)        
    x1i = (x1i-x11)/s1+1;
    x2i = (x2i-x21)/s2+1;
    x3i = (x3i-x31)/s3+1;
      loc1 = min(Nx1-1,floor(x1i));
      loc2 = min(Nx2-1,floor(x2i));
      loc3 = min(Nx3-1,floor(x3i));
      % x1i now becomes weight for x1i. Overwrite instead of creating new
      % variable for memory performance. Same for x2i and x3i.
      x1i = x1i-loc1;
      x2i = x2i-loc2;
      x3i = x3i-loc3;
      Vi  = (1-x1i)*((1-x2i)*((1-x3i)*V(loc1  +(loc2-1)*Nx1+(loc3-1)*Nx1*Nx2)   ...
                            +    x3i *V(loc1  +(loc2-1)*Nx1+(loc3)  *Nx1*Nx2))  ...
                   +    x2i *((1-x3i)*V(loc1  +(loc2)  *Nx1+(loc3-1)*Nx1*Nx2)   ...
                            +    x3i *V(loc1  +(loc2)  *Nx1+(loc3)  *Nx1*Nx2))) ...
           +    x1i*((1-x2i)*((1-x3i)*V(loc1+1+(loc2-1)*Nx1+(loc3-1)*Nx1*Nx2)   ...
                            +    x3i *V(loc1+1+(loc2-1)*Nx1+(loc3)  *Nx1*Nx2))  ...
                   +    x2i *((1-x3i)*V(loc1+1+(loc2)  *Nx1+(loc3-1)*Nx1*Nx2)   ...
                            +    x3i *V(loc1+1+(loc2)  *Nx1+(loc3)  *Nx1*Nx2)));
   end
end

2 comentarios
Mostrar Ninguno Ocultar Ninguno

Matt J el 3 de Ag. de 2014

My function is faster than griddedInterpolant but slower than interpn.

Are you sure you don't mean the other way around? interpn calls griddedInterpolant , so it shouldn't be faster than it.

Philip el 3 de Ag. de 2014

griddedInterpolant is a function that runs on the CPU. interpn can run either on the CPU or the GPU. You are correct that if you use interpn on the CPU, then it just calls griddedInterpolant. However, if you use interpn on the GPU, then MATLAB has their own built-in GPU version of this code that is faster than the CPU versions of interpn and griddedInterpolant. My code, when it runs on the GPU, is faster than griddedInterpolant and interpn running on the CPU, but is slower than interpn running on the GPU.

Iniciar sesión para comentar.

Iniciar sesión para responder a esta pregunta.

Follow Question

Answer 1

Matt J el 4 de Ag. de 2014

0 votos

It might be worth trying PAGEFUN on the GPU, where each page contains the data you are currently processing in a loop iteration.

1 comentario
Mostrar -1 comentarios más antiguos Ocultar -1 comentarios más antiguos

Philip el 4 de Ag. de 2014

Thanks for the suggestion. I tried this, but pagefun doesn't allow you to create custom functions like arrayfun does. I had to make multiple calls to pagefun, which slowed down the code. There maybe a better way to use pagefun, but I wasn't able to figure it out.

Iniciar sesión para comentar.

Answer 2

Matt J el 4 de Ag. de 2014

Editada: Matt J el 4 de Ag. de 2014

0 votos

You're doing a lot of repeat computations of things like 1-loc2, 1-xi1, etc... which might be avoided.

You're also expanding the interpolation formula into 22 multiplies and 8 adds. It should be possible to interpolate a 2x2x2 neighborhood in about 14 multiplies and 7 adds, by doing the interpolation in a separable manner, first along the x axis, then y, then z.

1 comentario
Mostrar -1 comentarios más antiguos Ocultar -1 comentarios más antiguos

Philip el 9 de Ag. de 2014

Hey Matt, thanks for the suggestions. I tried playing around with creating a second variable like loc3_=loc3-1 and then call it. It seems like the memory operation of storing the new variable is more costly than performing the calculation a few times of loc3-1.

I agree with your last statement. If you look closely at my parenthesis, I am doing 14 multiplications and 7 adds. It's just I written it as one operation, instead of storing the intermediate calculations of first interpolation on the x axis, then the y axis, then z axis. This way is faster since you do not have to create intermediate variables to store your results.

Iniciar sesión para comentar.

How to make my GPU interpolation code faster?

2 comentarios
Mostrar Ninguno Ocultar Ninguno

Respuestas (2)

1 comentario
Mostrar -1 comentarios más antiguos Ocultar -1 comentarios más antiguos

1 comentario
Mostrar -1 comentarios más antiguos Ocultar -1 comentarios más antiguos

Categorías

Productos

Etiquetas

Community Treasure Hunt

How to make my GPU interpolation code faster?

2 comentarios Mostrar Ninguno Ocultar Ninguno

Respuestas (2)

1 comentario Mostrar -1 comentarios más antiguos Ocultar -1 comentarios más antiguos

1 comentario Mostrar -1 comentarios más antiguos Ocultar -1 comentarios más antiguos

Categorías

Productos

Etiquetas

Ver también

Community Treasure Hunt

2 comentarios
Mostrar Ninguno Ocultar Ninguno

1 comentario
Mostrar -1 comentarios más antiguos Ocultar -1 comentarios más antiguos

1 comentario
Mostrar -1 comentarios más antiguos Ocultar -1 comentarios más antiguos