cuBLAS Example
This example multiplies two matrices A
and B
by
using the cuBLAS library. The MATLAB® implementation of GEneral Matrix-Matrix Multiplication (GEMM) is:
function [C] = blas_gemm(A,B) C = zeros(size(A)); C = A * B; end
Generated CUDA Code
When you generate CUDA® code, GPU Coder™ creates function calls to initialize the cuBLAS library, perform matrix-matrix operations, and release hardware resources that the cuBLAS library uses. The following is a snippet of the generated CUDA code.
cublasEnsureInitialization(); blas_gemm_kernel1<<<dim3(2048U, 1U, 1U), dim3(512U, 1U, 1U)>>>(gpu_C); alpha1 = 1.0; beta1 = 0.0; cudaMemcpy((void *)gpu_alpha1, (void *)&alpha1, 8ULL, cudaMemcpyHostToDevice); cudaMemcpy((void *)gpu_A, (void *)A, 8388608ULL, cudaMemcpyHostToDevice); cudaMemcpy((void *)gpu_B, (void *)B, 8388608ULL, cudaMemcpyHostToDevice); cudaMemcpy(gpu_beta1, &beta1, 8ULL, cudaMemcpyHostToDevice); cublasDgemm(cublasGlobalHandle, CUBLAS_OP_N, CUBLAS_OP_N, 1024, 1024, 1024, (double *)gpu_alpha1, (double *)&gpu_A[0], 1024, (double *)&gpu_B [0], 1024, (double *)gpu_beta1, (double *)&gpu_C[0], 1024); cublasEnsureDestruction(); cudaMemcpy((void *)C, (void *)gpu_C, 8388608ULL, cudaMemcpyDeviceToHost);
To initialize the cuBLAS library and create a handle to the cuBLAS library context, the
function cublasEnsureInitialization()
calls
cublasCreate()
cuBLAS API. It allocates hardware resources on the host
and device.
static void cublasEnsureInitialization(void) { if (cublasGlobalHandle == NULL) { cublasCreate(&cublasGlobalHandle); cublasSetPointerMode(cublasGlobalHandle, CUBLAS_POINTER_MODE_DEVICE); } }
blas_gemm_kernel1
initializes the result matrix C
to zero. This kernel is launched with 2048 blocks and 512 threads per block. These block and
thread values correspond to the size of C
.
static __global__ __launch_bounds__(512, 1) void blas_gemm_kernel1(real_T *C) { int32_T threadIdX; threadIdX = (int32_T)(blockDim.x * blockIdx.x + threadIdx.x); if (!(threadIdX >= 1048576)) { C[threadIdX] = 0.0; } }
Calls to cudaMemcpy
transfer the matrices A
and
B
from the host to the device. The function
cublasDgemm
is a level-3 Basic Linear Algebra Subprogram (BLAS3) that
performs the matrix-matrix multiplication:
C = αAB + βC
where α and β are scalars, and A, B, and C are matrices stored in column-major format.
CUBLAS_OP_N
controls transpose operations on the input matrices.
The final calls are to cublasEnsureDestruction()
and another
cudaMemcpy
. cublasEnsureDestruction()
calls
cublasCreate()
cuBLAS API to release hardware resources the cuBLAS
library uses. cudaMemcpy
copies the result matrix C
from the device to the host.
static void cublasEnsureDestruction(void) { if (cublasGlobalHandle != NULL) { cublasDestroy(cublasGlobalHandle); cublasGlobalHandle = NULL; } }
Prepare blas_gemm
for Kernel Creation
GPU Coder requires no special pragma to generate calls to
libraries. There are two ways to generate CUDA kernels — coder.gpu.kernelfun
and coder.gpu.kernel
. In this example, we utilize the
coder.gpu.kernelfun
pragma to generate CUDA kernels. The modified blas_gemm
function is:
function [C] = blas_gemm(A,B) %#codegen C = coder.nullcopy(zeros(size(A))); coder.gpu.kernelfun; C = A * B; end
Note
A minimum size (128 elements) is required on the input data for replacing math operators and functions with cuBLAS library implementations.