# gpucoder.batchedMatrixMultiply

Optimized GPU implementation of batched matrix multiply operation

## Syntax

``[D1,D2] = gpucoder.batchedMatrixMultiply(A1,B1,A2,B2)``
``[D1,...,DN] = gpucoder.batchedMatrixMultiply(A1,B1,...,AN,BN)``
``___ = gpucoder.batchedMatrixMultiply(___,Name,Value)``

## Description

````[D1,D2] = gpucoder.batchedMatrixMultiply(A1,B1,A2,B2)` performs matrix-matrix multiplication of a batch of matrices `A1,B1` and `A2,B2`. The `gpucoder.batchedMatrixMultiply` function performs matrix-matrix multiplication of the form: $D=\alpha AB$where $\alpha$ is a scalar multiplication factor, `A`, `B`, and `D` are matrices with dimensions `m`-by-`k`, `k`-by-`n`, and `m`-by-`n` respectively. You can optionally transpose or hermitian-conjugate `A` and `B`. By default, $\alpha$ is set to one and the matrices are not transposed. To specify a different scalar multiplication factor and perform transpose operations on the input matrices, use the `Name,Value` pair arguments.All the batches passed to the `gpucoder.batchedMatrixMultiply` function must be uniform. That is, all instances must have the same dimensions `m,n,k`.```
````[D1,...,DN] = gpucoder.batchedMatrixMultiply(A1,B1,...,AN,BN)` performs matrix-matrix multiplication of multiple `A`, `B` pairs of the form: ${D}_{i}=\alpha {A}_{i}{B}_{i}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}i=1\dots N$```

example

````___ = gpucoder.batchedMatrixMultiply(___,Name,Value)` performs batched matrix multiply operation by using the options specified by one or more `Name,Value` pair arguments.```

## Examples

Perform a simple batched matrix-matrix multiplication and use the `gpucoder.batchedMatrixMultiply` function to generate CUDA® code that calls appropriate `cublas<t>gemmBatched` APIs.

In one file, write an entry-point function `myBatchMatMul` that accepts matrix inputs `A1`, `B1`, `A2`, and `B2`. Because the input matrices are not transposed, use the `'nn'` option.

```function [D1,D2] = myBatchMatMul(A1,B1,A2,B2,alpha) [D1,D2] = gpucoder.batchedMatrixMultiply(A1,B1,A2,B2, ... 'alpha',alpha,'transpose','nn'); end ```

To create a type for a matrix of doubles for use in code generation, use the `coder.newtype` function.

```A1 = coder.newtype('double',[15,42],[0 0]); A2 = coder.newtype('double',[15,42],[0 0]); B1 = coder.newtype('double',[42,30],[0 0]); B2 = coder.newtype('double',[42,30],[0 0]); alpha = 0.3; inputs = {A1,B1,A2,B2,alpha}; ```

To generate a CUDA library, use the `codegen` function.

```cfg = coder.gpuConfig('lib'); cfg.GpuConfig.EnableCUBLAS = true; cfg.GpuConfig.EnableCUSOLVER = true; cfg.GenerateReport = true; codegen -config cfg-args inputs myBatchMatMul ```

The generated CUDA code contains kernels `myBatchMatMul_kernelNN` for initializing the input and output matrices. The code also contains the `cublasDgemmBatched` API calls to the cuBLAS library. The following code is a snippet of the generated code.

```// // File: myBatchMatMul.cu // ... void myBatchMatMul(const double A1, const double B1, const double A2 , const double B2, double alpha, double D1, double D2) { double alpha1; ... myBatchMatMul_kernel1<<<dim3(2U, 1U, 1U), dim3(512U, 1U, 1U)>>>(*gpu_A2, *gpu_A1, *gpu_input_cell_f2, *gpu_input_cell_f1); cudaMemcpy(gpu_B2, (void *)&B2, 10080UL, cudaMemcpyHostToDevice); cudaMemcpy(gpu_B1, (void *)&B1, 10080UL, cudaMemcpyHostToDevice); myBatchMatMul_kernel2<<<dim3(3U, 1U, 1U), dim3(512U, 1U, 1U)>>>(*gpu_B2, *gpu_B1, *gpu_input_cell_f4, *gpu_input_cell_f3); myBatchMatMul_kernel3<<<dim3(1U, 1U, 1U), dim3(480U, 1U, 1U)>>>(gpu_r3, gpu_r2); myBatchMatMul_kernel4<<<dim3(1U, 1U, 1U), dim3(32U, 1U, 1U)>>>(gpu_r2, *gpu_out_cell); myBatchMatMul_kernel5<<<dim3(1U, 1U, 1U), dim3(32U, 1U, 1U)>>>(gpu_r3, *gpu_out_cell); ... cublasDgemmBatched(getCublasGlobalHandle(), CUBLAS_OP_N, CUBLAS_OP_N, 15, 30, 42, (double *)gpu_alpha1, (double **)gpu_Aarray, 15, (double **)gpu_Barray, 42, (double *)gpu_beta1, (double **) gpu_Carray, 15, 2); myBatchMatMul_kernel6<<<dim3(1U, 1U, 1U), dim3(480U, 1U, 1U)>>>(*gpu_D2, *gpu_out_cell, *gpu_D1); ... } ```

## Input Arguments

Operands, specified as vectors or matrices. `A` and `B` must be 2-D arrays. The number of columns in `A` must be equal to the number of rows in `B`.

Data Types: `double` | `single` | `int8` | `int16` | `int32` | `int64` | `uint8` | `uint16` | `uint32` | `uint64`
Complex Number Support: Yes

### Name-Value Arguments

Specify optional comma-separated pairs of `Name,Value` arguments. `Name` is the argument name and `Value` is the corresponding value. `Name` must appear inside quotes. You can specify several name and value pair arguments in any order as `Name1,Value1,...,NameN,ValueN`.

Example: ```[D1,D2] = gpucoder.batchedMatrixMultiply(A1,B1,A2,B2,'alpha',0.3,'transpose','CC');```

Value of the scalar used for multiplication with `A`. Default value is one.

Character vector or string composed of two characters, indicating the operation performed on the matrices `A` and `B` prior to matrix multiplication. Possible values are normal (`'N'`), transposed (`'T'`), or complex conjugate transpose (`'C'`).

## Output Arguments

collapse all

Product, returned as a scalar, vector, or matrix. Array `D` has the same number of rows as input `A` and the same number of columns as input `B`.

Introduced in R2020a