How to speed up MEX function?

following mex code is running too slow, but I don't know why it is and how to make it faster. Any help is greatly appreciated!
calculate_my_way.cpp
#include "mex.hpp"
#include "mexAdapter.hpp"
#include <cmath>
class MexFunction : public matlab::mex::Function {
public:
void operator()(matlab::mex::ArgumentList outputs, matlab::mex::ArgumentList inputs) {
matlab::data::TypedArray<double> var0 = inputs[0];
matlab::data::TypedArray<double> var1 = inputs[1];
matlab::data::TypedArray<double> var2 = inputs[2];
matlab::data::TypedArray<double> var3 = inputs[3];
auto var0Iter = var0.begin();
auto var1Iter = var1.begin();
auto var2Iter = var2.begin();
auto var3Iter = var3.begin();
const int numOfElements = var0.getNumberOfElements();
double buffer = 0;
for (int x = 0; x<numOfElements; x++)
{
buffer = std::sin(*var0Iter) + std::sin(*var1Iter) + std::sin(*var2Iter) + std::cos(*var3Iter);
*var0Iter = buffer;
buffer = std::sin(*var1Iter + *var2Iter) + std::cos(*var3Iter);
*var1Iter = buffer;
var0Iter++;
var1Iter++;
var2Iter++;
var3Iter++;
}
outputs[0] = std::move(var0);
outputs[1] = std::move(var1);
}
};
It's just simple calculation, but this code runs even slower than native distance function which performs a lot more complicated calculation than just a few sin+cos.
I'm using compiler that came with Visual Studio 2017. below is how I run mex and the compiler setup info.
mex -v calculate_my_way.cpp
...
Compiler location: C:\Program Files (x86)\Microsoft Visual Studio\2017\Professional\
...
OPTIMFLAGS : /O2 /Oy- /DNDEBUG
and this is how I am seeing performance issues.
clear
size_test = 1e7;
var1 = zeros(size_test, 1);
var2 = zeros(size_test, 1);
var3 = zeros(size_test, 1);
var4 = zeros(size_test, 1);
cant_beat_me = @() distance(var1,var2,var3,var4);
elapsed_time = timeit(cant_beat_me);
mex_slow = @() calculate_my_way(var1,var2,var3,var4);
elapsed_time = timeit(mex_slow);

15 comentarios

Rik
Rik el 2 de Nov. de 2022
Apart from the segfault if var1 is longer than the others, did you try with a random test set as well? The distance function may have some calls optimized away.
I might be able to try this code on my desktop later today.
buffer = std::sin(*var0Iter) + std::sin(*var1Iter) + std::sin(*var2Iter) + std::cos(*var3Iter);
*var0Iter = buffer;
buffer = std::sin(*var1Iter + *var2Iter) + std::cos(*var3Iter);
You calculate std::cos(*var3Iter) twice
Yifan Lin
Yifan Lin el 2 de Nov. de 2022
@Rik, I'm not concerned about the safety just yet, i can't even catch up without safety checks, how can I catch up with additional argument checks?
@Walter Roberson, sure, I did. But I bet you distance function has more than just one singular sin calculation that has been repeated.
I'm guessing this is a compiler choice? Does matlab uses intel compiler that I don't have?
Bruno Luong
Bruno Luong el 2 de Nov. de 2022
Editada: Bruno Luong el 2 de Nov. de 2022
"I'm guessing this is a compiler choice? Does matlab uses intel compiler that I don't have?"
I have Intel compiler I can test.
But Matlab can implement with vector arithmetics with multi-threading, you also could with OpenMP.
There are few people here that do miracles with Mex programing, James Tursa and Jan Simon to cite fews, but I believe they are C oriented and less C++.
Walter Roberson
Walter Roberson el 2 de Nov. de 2022
Which distance function are you comparing to?
Yifan Lin
Yifan Lin el 2 de Nov. de 2022
@Bruno Luong Thanks! If you can test it out using the Intel compiler it'd be great! I wish I can do C, but I probably can't, since I want to be able to invoke some other functions that are written in C++.
@Walter Roberson. Mapping toolbox distance, it's just out of convenience since it's also 4 inputs 2 outputs. Distance between points on sphere or ellipsoid - MATLAB distance (mathworks.com)
Walter Roberson
Walter Roberson el 2 de Nov. de 2022
The Mapping Toolbox distance() function is not coded in mex. You can read the MATLAB source code for it. The code converts the angles to radians, and then uses its local function greatcircledist() to compute using the haversine formula, and then does something that I do not recognize at the moment involving atan2() -- at least for the default calculation. There is a different code path if you use some of the options.
timeit result of your code with VS compiler and Intel OneAPI compiler (2022)
VS_elapsed_time % 0.1795
Intel_elapsed_time % 0.1781
Bruno Luong
Bruno Luong el 2 de Nov. de 2022
Editada: Bruno Luong el 2 de Nov. de 2022
Obviously evalutae cos/sin depends run time on data
Compare between MATLAB and cpp with zero data
clear
size_test = 1e7;
var1 = zeros(size_test, 1);
var2 = zeros(size_test, 1);
var3 = zeros(size_test, 1);
var4 = zeros(size_test, 1);
cant_beat_me = @() distance(var1,var2,var3,var4);
mex_slow = @() calculate_my_way(var1,var2,var3,var4);
MATLAB_elapsed_time = timeit(cant_beat_me) % 0.0274
Intel_elapsed_time = timeit(mex_slow) % 0.1803
function [out0,out1] = distance(var0, var1, var2, var3)
out0 = sin(var0) + sin(var1) + sin(var2) + cos(var3);
out1 = sin(var1 + var2) + cos(var3);
end
with random data
clear
size_test = 1e7;
var1 = 2*pi*rand(size_test, 1);
var2 = 2*pi*rand(size_test, 1);
var3 = 2*pi*rand(size_test, 1);
var4 = 2*pi*rand(size_test, 1);
cant_beat_me = @() distance(var1,var2,var3,var4);
mex_slow = @() calculate_my_way(var1,var2,var3,var4);
MATLAB_elapsed_time = timeit(cant_beat_me) % 0.1560
Intel_elapsed_time = timeit(mex_slow) % 0.5101
The factor of
>> 0.5101/0.156
ans =
3.2699
could be well explained by multi-thread.
Yifan Lin
Yifan Lin el 2 de Nov. de 2022
@Bruno Luong Thanks! and darn. I guess it's not the compiler? So, now I think what's left to try are
  1. Do this in C. Try to eliminate the possible C++ mex overhead?
  2. vector arithmetics with multi-threading, like you suggested with OpenMP.
Bruno Luong
Bruno Luong el 2 de Nov. de 2022
Or stay with MATLAB?
Yifan Lin
Yifan Lin el 2 de Nov. de 2022
@Bruno Luong It'd be nice to stay with MATLAB. but my code is just an example of the eventual implementation. It won't just be simple sin/cos. Right now what I'm doing is trying to understand if mex can actually achieve the speed/performance I need.
Yifan Lin
Yifan Lin el 2 de Nov. de 2022
@Bruno Luong Thanks again! I will definitely give OpenMP a try!
By curiosity I code the same calculation in C. Time is 0.24 sec; twice faster than C++ (0.5 sec) but 60% slower than MATLAB (0.147 sec).
/* mex -g -R2018a calculate_C_way.c */
#include "mex.h"
#include <math.h>
void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[])
{
int i, n;
double *var0Iter, *var1Iter, *var2Iter, *var3Iter, *out0Iter, *out1Iter;
n = mxGetNumberOfElements(prhs[0]);
plhs[0] = mxCreateNumericMatrix(1, n, mxDOUBLE_CLASS, mxREAL);
plhs[1] = mxCreateNumericMatrix(1, n, mxDOUBLE_CLASS, mxREAL);
var0Iter = mxGetDoubles(prhs[0]);
var1Iter = mxGetDoubles(prhs[1]);
var2Iter = mxGetDoubles(prhs[2]);
var3Iter = mxGetDoubles(prhs[3]);
out0Iter = mxGetDoubles(plhs[0]);
out1Iter = mxGetDoubles(plhs[1]);
for (i = 0; i < n; i++) {
*out0Iter = sin(*var0Iter) + sin(*var1Iter) + sin(*var2Iter) + cos(*var3Iter);
*out1Iter = sin(*var1Iter + *var2Iter) + cos(*var3Iter);
out0Iter++;
out1Iter++;
var0Iter++;
var1Iter++;
var2Iter++;
var3Iter++;
}
}
Yifan Lin
Yifan Lin el 3 de Nov. de 2022
@Bruno Luong, Thanks! I was also curious and wanted to give this a try, but you beat me to it! Yes, apparently C++ API is slower than C API for MATLAB. Ref: this post - Is C++ MEX API significantly slower than the C MEX API? - MATLAB Answers - MATLAB Central (mathworks.com). I've also tried openmp like you suggested, but the problem was, I was using VS2017, so I couldn't do #pragma omp simd. I'll wait for my VS2019 install to finish and try again there with the C API.

Iniciar sesión para comentar.

 Respuesta aceptada

Bruno Luong
Bruno Luong el 3 de Nov. de 2022
Editada: Bruno Luong el 3 de Nov. de 2022
Last experience, Time with C OpenMP, Intel Parallel Studio XE 2022
CIntel_elapsed_time = 0.0574 [sec]
2.5 faster than MATLAB (finally I beat MATLAB).
To have fast mex: Use C-API (not Cpp), Make it multi-thread, Select a decent compiler.
/* Compile with intel compiler
mex -O COMPFLAGS="$COMPFLAGS /MD /Qopenmp" -R2018a calculate_C_way.c */
#include "mex.h"
#include <math.h>
/* Set to 1 to Enable OPENMP
to 0 to disable it */
#define OPENMP_FLAG 1
#if OPENMP_FLAG == 1
#include <omp.h>
#endif
void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[])
{
int i, n;
double *var0Iter, *var1Iter, *var2Iter, *var3Iter, *out0Iter, *out1Iter;
n = mxGetNumberOfElements(prhs[0]);
plhs[0] = mxCreateNumericMatrix(1, n, mxDOUBLE_CLASS, mxREAL);
plhs[1] = mxCreateNumericMatrix(1, n, mxDOUBLE_CLASS, mxREAL);
var0Iter = mxGetDoubles(prhs[0]);
var1Iter = mxGetDoubles(prhs[1]);
var2Iter = mxGetDoubles(prhs[2]);
var3Iter = mxGetDoubles(prhs[3]);
out0Iter = mxGetDoubles(plhs[0]);
out1Iter = mxGetDoubles(plhs[1]);
#if OPENMP_FLAG==1
#pragma omp parallel for default(none) private(i) \
schedule(static) \
shared(n, out0Iter, out1Iter, var0Iter, var1Iter, var2Iter, var3Iter)
#endif
for (i = 0; i < n; i++) {
out0Iter[i] = sin(var0Iter[i]) + sin(var1Iter[i]) + sin(var2Iter[i]) + cos(var3Iter[i]);
out1Iter[i] = sin(var1Iter[i] + var2Iter[i]) + cos(var3Iter[i]);
}
}

2 comentarios

Yifan Lin
Yifan Lin el 3 de Nov. de 2022
@Bruno Luong Thank you very much!!!! This is exactly what I was looking for!
Typically, instead of this
#define OPENMP_FLAG 1
#if OPENMP_FLAG == 1
#include <omp.h>
#endif
you can use this:
#ifdef _OPENMP
#include <omp.h>
#endif
The _OPENMP macro is defined by the compiling environment when OpenMP is available.

Iniciar sesión para comentar.

Más respuestas (1)

Bruno Luong
Bruno Luong el 2 de Nov. de 2022
Editada: Bruno Luong el 2 de Nov. de 2022
I don't know well C++, but I have practiced quite a lot mex C.
It looks like this statement just move a bunch of data
outputs[0] = std::move(var0);
outputs[1] = std::move(var1);
ALso I wonder if your input "0, and 1 would change
*var0Iter = buffer;
...
*var1Iter = buffer;
after calling the mex, which is NOT allowed.

2 comentarios

Yifan Lin
Yifan Lin el 2 de Nov. de 2022
@Bruno Luong! Another one of your answer here helped me tremendously a few years back! thank you!
I've tested the var0 and var1 value, they did change. And they get moved to the output.
So, [a,b] = calculate_my_way(0,0,0,0); [a,b] will be both 1.
I have a suspicion that this slowness may be either
1. MSVC is not as good as the one Mathworks uses (probably Intel Parallel Studio)
2. the C++ Mex function calling may be problematic with some massive overhead that I don't know.
3. I am just not doing something right in my c++ code?
Bruno Luong
Bruno Luong el 2 de Nov. de 2022
" Another one of your answer here helped me tremendously a few years back! thank you! "
Oh... realy glad to read that...

Iniciar sesión para comentar.

Categorías

Más información sobre Write C Functions Callable from MATLAB (MEX Files) en Centro de ayuda y File Exchange.

Productos

Versión

R2019b

Preguntada:

el 1 de Nov. de 2022

Comentada:

el 7 de Nov. de 2022

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by