GCC compiled MEX file taking more time than the one compiled by Microsoft Visual Studio.

Question

Ubaid Ullah el 4 de Jul. de 2015

0
Enlazar

Enlace directo a esta pregunta

https://la.mathworks.com/matlabcentral/answers/228412-gcc-compiled-mex-file-taking-more-time-than-the-one-compiled-by-microsoft-visual-studio

Comentada: Ubaid Ullah el 8 de Jul. de 2015

opts_files.zip

Hello, I have the following loop:

spmd
    dgtilde = zeros(length(denom),d.nexp2);
    for mm = 1:d.nexp2
        dgtilde(:,mm) = sum(g{d.exp2(mm,1)}.*g{d.exp2(mm,2)}.*weight,2) ...
            - gtilde(:,d.exp2(mm,1)).*gtilde(:,d.exp2(mm,2));
    end 
end

I converted the inner loop to C code as follows:

#include <math.h>
#include <matrix.h>
#include <mex.h>
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
void mexFunction(int nlhs, mxArray *plhs[],
        int nrhs, const mxArray *prhs[])
{
    const mwSize *dims;
    const mxArray *cell;
    const mxArray *cellArray1, *cellArray2;
      double *pr1, *pr2;
      double *weight, *gtilde;
      double *exp2;
      double *sum_gammaXmom;
      int mom, cellSize, nnz, mm1, mm2, sgIndex;
      bool issparse1, issparse2;
      mwIndex i, j, k, count, jcell,*ir, *jc;
      mwSize ncol, nrow;
      cell = prhs[0];
      mom = (int)mxGetScalar(prhs[1]);
      weight = mxGetPr(prhs[2]);
      exp2 = mxGetPr(prhs[3]);
      dims = mxGetDimensions(prhs[3]);
      gtilde = mxGetPr(prhs[4]);
      if(mom>dims[0]) mexErrMsgTxt("d.mom variable exceeds g cell array size.");
      jcell = 0;
      cellArray1 = mxGetCell(prhs[0], jcell);
      cellSize = mxGetNumberOfElements(prhs[0]);
      nrow = mxGetM(cellArray1);
      ncol = mxGetN(cellArray1);
      plhs[0] = mxCreateDoubleMatrix(nrow, mom, mxREAL);
      sum_gammaXmom = mxGetPr(plhs[0]);
      count = 0;
      for(j=0;j<(mom*nrow);j++) sum_gammaXmom[j] = 0;
      for (jcell=0; jcell<mom; jcell++) {
          mm1 = (int)exp2[jcell]-1;
          mm2 = (int)exp2[jcell+mom]-1;
          cellArray1 = mxGetCell(prhs[0], mm1);
          cellArray2 = mxGetCell(prhs[0], mm2);
          pr1 = mxGetPr(cellArray1);
          pr2 = mxGetPr(cellArray2);
          for(i=0;i<nrow;i++) {
              sgIndex = i+jcell*nrow;
              for(j=0;j<ncol;j++){
                  sum_gammaXmom[sgIndex] += pr1[i+j*nrow]*pr2[i+j*nrow]*weight[i+j*nrow];
              }
              sum_gammaXmom[sgIndex] = sum_gammaXmom[sgIndex]-gtilde[i+mm1*nrow]*gtilde[i+mm2*nrow];
          }
      }
}

When I compiled the MEX file through Microsoft Visual Studio compiler on Windows machine, it reduces the execution time to half. On the other hand, when I compiled the file to MEX using GCC compiler, the execution time didn't get better at all. I have two questions:

Why is there this difference between the performance of two compilers?
Is there a way to improve C code to perform better?
Should I expect an improvement in the speed if I use a 3D matrix 'g' as an input, instead of a cell array of double matrices 'g'.

g variable is a composite with each lab's data containing a cell array of double matrices.
weight variable is a composite with each lab's data containing a double matrix.
sum_gammaXmom variable is computing dgtilde.

Addendum:

Actually, I have a client who is working on a linux/unix based system with gcc. When I first delivered him C files, he compiled and told me that its only 2x faster than native MATLAB, where I was getting 3x improvement with Microsoft Visual Studio. So I installed GCC on my computer and tested my C functions, and got the same 3x improvement that I was getting with MVS compilers. I asked him to compile with O1, O2, O3 options, but no luck there. I am attaching the mex_C_glnxa64.xml file he is using in his computer and gcc MEXOPTS.bat file that I am using on my local machine. Can you guys tell me if we are using any different parameters that is causing this difference in performance on two machines.

thanks.

3 comentarios
Mostrar 1 comentario más antiguoOcultar 1 comentario más antiguo

Ubaid Ullah el 4 de Jul. de 2015

Thanks for your comment dpb. I have checked GCC compiler with O1 to O3 switches, no difference so far.

dpb el 4 de Jul. de 2015

Surprising; gcc is generally considered quite good. Do you have a recent release; what are you running it under/is it a native installation or under an emulation layer or something by any chance?

Iniciar sesión para comentar.

Iniciar sesión para responder a esta pregunta.

Answer 1

Ivo Houtzager el 4 de Jul. de 2015

2
Enlazar

Enlace directo a esta respuesta

https://la.mathworks.com/matlabcentral/answers/228412-gcc-compiled-mex-file-taking-more-time-than-the-one-compiled-by-microsoft-visual-studio#answer_185025

There is difference in the default floating-point optimization between the compilers.

The floating-point calculations from the GCC compiler follows the strict IEEE compliance by default. The optional -ffast-math flag enables optimizations that can break the strict IEEE compliance. You can try if this option improves the speed at the possible cost of accuracy.

The floating point calculations from the VS compiler does not preserve strict IEEE compliance by default. The default option /fp:precise enables some non-strict optimizations. If you need strict floating point calculations from the VS compiler use the /fp:strict option. For the fastest floating-point calculations that VS compiler can offer use the /fp:fast option.

The VS compiler also enables the use of SSE2 instructions (option /arch:SSE2) by default on x86 platforms. The GCC does not enable the use of SSE2 instructions by default. To enable instructions supported by most common proccesors use the option -mtune=generic.

4 comentarios
Mostrar 2 comentarios más antiguosOcultar 2 comentarios más antiguos

Ivo Houtzager el 8 de Jul. de 2015

Abrir en MATLAB Online

The following line shows the optimization options from mexopts.bat.

set OPTIMFLAGS=-O3 -funroll-loops -DNDEBUG

The following line shows the optimization options from mex_C_glxna64.xml.

COPTIMFLAGS="-O -DNDEBUG"

Thus the compiler on the windows platform optimizes more than the linux platform (O3 vs O level). Further, loop unrolling is enabled for the windows compiler. You can set the compile options from the mexopts.bat to the mex_C_glxna64.xml to improve the optimization. You can even try to improve optimization further by adding -ffast-math and/or -mtune=generic options to the line as discussed above.

Ubaid Ullah el 8 de Jul. de 2015

Well my client tried O1 to O3, but he didn't see any improvement. I will ask him to use -ffast-math and -mtune=generic options.

Iniciar sesión para comentar.

Answer 2

Jan el 4 de Jul. de 2015

0
Enlazar

Enlace directo a esta respuesta

https://la.mathworks.com/matlabcentral/answers/228412-gcc-compiled-mex-file-taking-more-time-than-the-one-compiled-by-microsoft-visual-studio#answer_185022

Abrir en MATLAB Online

Why is there this difference between the performance of two compilers?

Compilers translate the C code to machine instructions. There are different possible translations, which lead to the same results but with different runtime. E.g. a compiler can create MMX, SSE, SSE2 or SSE3 instructions. Some will run on modern processors only, others support older processors also. Therefore it is expected that different compilers create programs with different speed.

Try memset instead of a loop to set sum_gammaXmom to zero. Or even better: Omit this zero'ing, because mxCreateDoubleMatrix fills the array with zeros already.

sum_gammaXmom[sgIndex] += pr1[i+j*nrow]*pr2[i+j*nrow]*weight[i+j*nrow];

You could try if storing i+j*nrow in a variable avoid the repeated calculation of the same value. But I hope that smart compilers recognize this. A general problem remains the memory access: It is much cheaper to read and write to and from neighboring elementes in the memory. Is it possible to run the loop over i in the inside, such that [i+j*nrow] accesses contiguos memory elements?

5 comentarios
Mostrar 3 comentarios más antiguosOcultar 3 comentarios más antiguos

Jan el 5 de Jul. de 2015

Accessing 25 cells costs less than a millisecond. But I do not understand what "with each cell having an 25-element array of double matrices" means.

Ubaid Ullah el 7 de Jul. de 2015

@Jan. Sorry about that. I corrected the sentence.

Iniciar sesión para comentar.

GCC compiled MEX file taking more time than the one compiled by Microsoft Visual Studio.

3 comentarios
Mostrar 1 comentario más antiguoOcultar 1 comentario más antiguo

Respuestas (2)

4 comentarios
Mostrar 2 comentarios más antiguosOcultar 2 comentarios más antiguos

5 comentarios
Mostrar 3 comentarios más antiguosOcultar 3 comentarios más antiguos

Ver también

Categorías

Etiquetas

Productos

Community Treasure Hunt

GCC compiled MEX file taking more time than the one compiled by Microsoft Visual Studio.

3 comentarios Mostrar 1 comentario más antiguoOcultar 1 comentario más antiguo

Respuestas (2)

4 comentarios Mostrar 2 comentarios más antiguosOcultar 2 comentarios más antiguos

5 comentarios Mostrar 3 comentarios más antiguosOcultar 3 comentarios más antiguos

Ver también

Categorías

Etiquetas

Productos

Community Treasure Hunt

3 comentarios
Mostrar 1 comentario más antiguoOcultar 1 comentario más antiguo

4 comentarios
Mostrar 2 comentarios más antiguosOcultar 2 comentarios más antiguos

5 comentarios
Mostrar 3 comentarios más antiguosOcultar 3 comentarios más antiguos