Main Content

Multithreaded MEX File Generation

This example shows how to use the dspunfold function to generate a multithreaded MEX file from a MATLAB® function using unfolding technology. The MATLAB function can contain an algorithm which is stateless (has no states) or stateful (has states).

NOTE: The following example assumes that the current host computer has at least two physical CPU cores. The presented screenshots, speedup, and latency values were collected using a host computer with eight physical CPU cores.

Required MathWorks® products:

  • DSP System Toolbox™

  • MATLAB Coder™

Use dspunfold with a MATLAB Function Containing a Stateless Algorithm

Consider the MATLAB function dspunfoldDCTExample. This function computes the DCT of an input signal and returns the value and index of the maximum energy point.

function [peakValue,peakIndex] = dspunfoldDCTExample(x)
% Stateless MATLAB function computing the dct of a signal (e.g. audio), and
% returns the value and index of the highest energy point

% Copyright 2015 The MathWorks, Inc.

X = dct(x);   
[peakValue,peakIndex] = max(abs(X));

To accelerate the algorithm, a common approach is to generate a MEX file using the codegen function. This example shows how to do so when using an input of 4096 doubles. The generated MEX file, dspunfoldDCTExample_mex, is singlethreaded.

codegen dspunfoldDCTExample -args {(1:4096)'}

To generate a multithreaded MEX file, use the dspunfold function. The argument -s 0 indicates that the algorithm in dspunfoldDCTExample is stateless.

dspunfold dspunfoldDCTExample -args {(1:4096)'} -s 0

This command generates these files:

  • Multithreaded MEX file dspunfoldDCTExample_mt

  • Single-threaded MEX file dspunfoldDCTExample_st, which is identical to the MEX file obtained using the codegen function

  • Self-diagnostic analyzer function dspunfoldDCTExample_analyzer

Additional three MATLAB files are also generated, containing the help for each of the above files.

To measure the speedup of the multithreaded MEX file relative to the single-threaded MEX file, see the example function dspunfoldBenchmarkDCTExample.

function dspunfoldBenchmarkDCTExample
% Function used to measure the speedup of the multi-threaded MEX file
% dspunfoldDCTExample_mt obtained using dspunfold vs the single-threaded MEX
% file dspunfoldDCTExample_st.

% Copyright 2015 The MathWorks, Inc.

clear dspunfoldDCTExample_mt;  % for benchmark precision purpose
numFrames = 1e5;
inputFrame = (1:4096)';

% exclude first run from timing measurements
tic;  % measure execution time for the single-threaded MEX
for frame = 1:numFrames 
timeSingleThreaded = toc;

% exclude first run from timing measurements
tic;  % measure execution time for the multi-threaded MEX
for frame = 1:numFrames
timeMultiThreaded = toc;
fprintf('Speedup = %.1fx\n',timeSingleThreaded/timeMultiThreaded);

dspunfoldBenchmarkDCTExample measures the execution time taken by dspunfoldDCTExample_st and dspunfoldDCTExample_mt to process numFrames frames. Finally, it prints the speedup, which is the ratio between the multithreaded MEX file execution time and single-threaded MEX file execution time. Run the example.

Speedup = 4.7x

To improve the speedup even more, increase the repetition value. To modify the repetition value, use the -r flag. For more information on the repetition value, see the dspunfold function reference page. For an example on how to specify the repetition value, see the section 'Using dspunfold with a MATLAB Function Containing a Stateful Algorithm'.

dspunfold generates a multithreaded MEX file, which buffers multiple signal frames and then processes these frames simultaneously, using multiple cores. This process introduces some deterministic output latency. Executing help dspunfoldDCTExample_mt displays more information about the multithreaded MEX file, including the value of the output latency. For this example, the output of the multithreaded MEX file has a latency of 16 frames relative to its input, which is not the case for the single-threaded MEX file.

Run dspunfoldShowLatencyDCTExample example. The generated plot displays the outputs of the single-threaded and multithreaded MEX files. Notice that the output of the multithreaded MEX is delayed by 16 frames, relative to that of the single-threaded MEX.


Using dspunfold with a MATLAB Function Containing a Stateful Algorithm

The MATLAB function dspunfoldFIRExample executes two FIR filters.

type dspunfoldFIRExample
function y = dspunfoldFIRExample(u,c1,c2)
% Stateful MATLAB function executing two FIR filters 

% Copyright 2015 The MathWorks, Inc.

if isempty(FIRSTFIR)
    FIRSTFIR = dsp.FIRFilter('NumeratorSource','Input port');
    SECONDFIR = dsp.FIRFilter('NumeratorSource','Input port');
t = FIRSTFIR(u,c1);
y = SECONDFIR(t,c2);

To build the multithreaded MEX file, you must provide the state length corresponding to the two FIR filters. Specify 1s to indicate that the state length does not exceed 1 frame.

firCoeffs1 = fir1(127,0.8);
firCoeffs2 = fir1(256,0.2,'High');
dspunfold dspunfoldFIRExample -args {(1:2048)',firCoeffs1,firCoeffs2} -s 1

Executing this code generates:

  • Multithreaded MEX file dspunfoldFIRExample_mt

  • Single-threaded MEX file dspunfoldFIRExample_st

  • Self-diagnostic analyzer function dspunfoldFIRExample_analyzer

  • The corresponding MATLAB help files for these three files

The output latency of the multithreaded MEX file is 16 frames. To measure the speedup, execute dspunfoldBenchmarkFIRExample.

Speedup = 3.9x

To improve the speedup of the multithreaded MEX file even more, specify the exact state length in samples. To do so, you must specify which input arguments to dspunfoldFIRExample are frames. In this example, the first input is a frame because the elements of this input are sequenced in time. Therefore it can be further divided into subframes. The last two inputs are not frames because the FIR filters coefficients cannot be subdivided without changing the nature of the algorithm. The value of the dspunfoldFIRExample MATLAB function state length is the sum of the state length of the two FIR filters (127 + 256 = 383). Using the -f argument, mark the first input argument as true (frame), and the last two input arguments as false (nonframes)

dspunfold dspunfoldFIRExample -args {(1:2048)',firCoeffs1,firCoeffs2} -s 383 -f [true,false,false]

Again, measure the speedup for the resulting multithreaded MEX using the dspunfoldBenchmarkFIRExample function. Notice that the speedup increased because the exact state length was specified in samples, and dspunfold was able to subdivide the frame inputs.

Speedup = 6.3x

Oftentimes, the speedup can be increased even more by increasing the repetition (-r) provided when invoking dspunfold. The default repetition value is 1. When you increase this value, the multithreaded MEX buffers more frames internally before the processing starts. Increasing the repetition factor increases the efficiency of the multi-threading, but at the cost of a higher output latency.

dspunfold dspunfoldFIRExample -args {(1:2048)',firCoeffs1,firCoeffs2} ...
-s 383 -f [true,false,false] -r 5

Again, measure the speedup for the resulting multithreaded MEX, using the dspunfoldBenchmarkFIRExample function. Speedup increases again, but the output latency is now 80 frames. The general output latency formula is 2*Threads*Repetition frames. In these examples, the number of Threads is equal to the number of physical CPU cores.

Speedup = 7.7x

Detecting State Length Automatically

To request that dspunfold autodetect the state length, specify -s auto. This option generates an efficient multithreaded MEX file, but with a significant increase in the generation time, due to the extra analysis that it requires.

dspunfold dspunfoldFIRExample -args {(1:2048)',firCoeffs1,firCoeffs2} ...
-s auto -f [true,false,false] -r 5
State length: [autodetect] samples, Repetition: 5, Output latency: 40 frames, Threads: 4
Analyzing: dspunfoldFIRExample.m
Creating single-threaded MEX file: dspunfoldFIRExample_st.mexw64
Searching for minimal state length (this might take a while)
Checking stateless ... Insufficient
Checking 2048 samples ... Sufficient
Checking 1024 samples ... Sufficient
Checking 512 samples ... Sufficient
Checking 256 samples ... Insufficient
Checking 384 samples ... Sufficient
Checking 320 samples ... Insufficient
Checking 352 samples ... Insufficient
Checking 368 samples ... Insufficient
Checking 376 samples ... Insufficient
Checking 380 samples ... Insufficient
Checking 382 samples ... Insufficient
Checking 383 samples ... Sufficient
Minimal state length is 383 samples
Creating multi-threaded MEX file: dspunfoldFIRExample_mt.mexw64
Creating analyzer file: dspunfoldFIRExample_analyzer.p

dspunfold checks different state lengths, using as inputs the values provided with the -args option. The function aims to find the minimum state length for which the outputs of the multithreaded MEX and single-threaded MEX are the same. Notice that it found 383, as the minimal state length value, which matches the expected value, manually computed before.

Verify Generated Multithreaded MEX Using the Generated Analyzer

When creating a multithreaded MEX file using dspunfold, the single-threaded MEX file is also created along with an analyzer function. For the stateful example in the previous section, the name of the analyzer is dspunfoldFIRExample_analyzer.

The goal of the analyzer is to provide a quick way to measure the speedup of the multithreaded MEX relative to the single-threaded MEX, and also to check if the outputs of the multithreaded MEX and single-threaded MEX match. Outputs usually do not match when an incorrect state length value is specified.

Execute the analyzer for the multithreaded MEX file, dspunfoldFIRExample_mt, generated previously using the -s auto option.

firCoeffs1_1 = fir1(127,0.8);
firCoeffs1_2 = fir1(127,0.7);
firCoeffs1_3 = fir1(127,0.6);
firCoeffs2_1 = fir1(256,0.2,'High');
firCoeffs2_2 = fir1(256,0.1,'High');
firCoeffs2_3 = fir1(256,0.3,'High');
Analyzing multi-threaded MEX file dspunfoldFIRExample_mt.mexw64  ... 
Latency = 80 frames
Speedup = 7.8x

Each input to the analyzer corresponds to the inputs of the dspunfoldFIRExample_mt MEX file. Notice that the length (first dimension) of each input is greater than the expected length. For example, dspunfoldFIRExample_mt expects a frame of 2048 doubles for its first input, while 2048*3 samples were provided to dspunfoldFIRExample_analyzer. The analyzer interprets this input as 3 frames of 2048 samples. The analyzer alternates between these 3 input frames circularly while checking if the outputs of the multithreaded and single-threaded MEX files match.

The table shows the inputs used by the analyzer at each step of the numerical check. The total number of steps invoked by the analyzer is 240 or 3*latency, where latency is 80 in this case.


Input 1

Input 2

Input 3

Step 1




Step 2




Step 3




Step 4








NOTE: For the analyzer to correctly check for the numerical match between the multithreaded MEX and single-threaded MEX, provide at least two frames with different values for each input. For inputs that represent parameters, such as filter coefficients, the frames can have the same values for each input. In this example, you could have specified a single set of coefficients for the second and third inputs.

See Also

| | | |