Main Content

Customize Bitstream Configuration to Meet Resource Use Requirements

This example shows how to deploy a digit recognition network with a target performance of 500 frames per second (FPS) to a Xilinx™ ZCU102 ZU4CG device. The target device resource counts are:

  • Digital signal processor (DSP) slice count — 240

  • Block random access memory (BRAM) count — 128

The reference zcu102_int8 bitstream configuration is for a Xilinx ZCU102 ZU9EG device. The default board resource counts are:

  • Digital signal processor (DSP) slice count — 2520

  • Block random access memory (BRAM) count — 912

The default board resource counts exceed the resource budget and are on the higher end of the cost spectrum. In this example, you can achieve target performance and resource use budget by quantizing the target deep learning network and customizing the bitstream configuration.

Prerequisites

  • Deep Learning HDL Toolbox™ Support Package for Xilinx FPGA and SoC

  • Deep Learning Toolbox™

  • Deep Learning HDL Toolbox™

  • Deep Learning Toolbox Model Quantization Library

Load Pretrained Network

To load the pretrained series network, that has been trained on the Modified National Institute Standards of Technology (MNIST) database, enter:

snet = getDigitsNetwork;

Quantize Network

To quantize the MNIST based digits network, enter:

dlquantObj = dlquantizer(snet,'ExecutionEnvironment','FPGA');
Image = imageDatastore('five_28x28.pgm','Labels','five');
calibrate(dlquantObj,Image);

Retrieve zcu102_int Bitstream Configuration

To retrieve the zcu102_int8 bitstream configuration, use the dlhdl.ProcessorConfig object. For more information, see dlhdl.ProcessorConfig. To learn about modifiable parameters of the processor configuration, see getModuleProperty and setModuleProperty.

referencehPC = dlhdl.ProcessorConfig('Bitstream','zcu102_int8')
referencehPC = 
                    Processing Module "conv"
                            ModuleGeneration: 'on'
                          LRNBlockGeneration: 'off'
                 SegmentationBlockGeneration: 'on'
                            ConvThreadNumber: 64
                             InputMemorySize: [227 227 3]
                            OutputMemorySize: [227 227 3]
                            FeatureSizeLimit: 2048

                      Processing Module "fc"
                            ModuleGeneration: 'on'
                      SoftmaxBlockGeneration: 'off'
                              FCThreadNumber: 16
                             InputMemorySize: 25088
                            OutputMemorySize: 4096

                  Processing Module "custom"
                            ModuleGeneration: 'on'
                                    Addition: 'on'
                                   MishLayer: 'off'
                              Multiplication: 'on'
                                    Resize2D: 'on'
                                     Sigmoid: 'off'
                                  SwishLayer: 'off'
                                   TanhLayer: 'off'
                             InputMemorySize: 40
                            OutputMemorySize: 120

              Processor Top Level Properties
                              RunTimeControl: 'register'
                               RunTimeStatus: 'register'
                          InputStreamControl: 'register'
                         OutputStreamControl: 'register'
                                SetupControl: 'register'
                           ProcessorDataType: 'int8'
                            UseVendorLibrary: 'off'

                     System Level Properties
                              TargetPlatform: 'Xilinx Zynq UltraScale+ MPSoC ZCU102 Evaluation Kit'
                             TargetFrequency: 250
                               SynthesisTool: 'Xilinx Vivado'
                             ReferenceDesign: 'AXI-Stream DDR Memory Access : 3-AXIM'
                     SynthesisToolChipFamily: 'Zynq UltraScale+'
                     SynthesisToolDeviceName: 'xczu9eg-ffvb1156-2-e'
                    SynthesisToolPackageName: ''
                     SynthesisToolSpeedValue: ''

Estimate Network Performance and Resource Utilization for zcu102_int8 Bitstream Configuration

To estimate the performance of the digits series network, use the estimatePerformance method of the dlhdl.ProcessorConfig object. The method returns the estimated layer latency, network latency, and network performance in frames per second (Frames/s).

estimatePerformance(referencehPC,dlquantObj)
### An output layer called 'Output1_softmax' of type 'nnet.cnn.layer.RegressionOutputLayer' has been added to the provided network. This layer performs no operation during prediction and thus does not affect the output of the network.
### Optimizing network: Fused 'nnet.cnn.layer.BatchNormalizationLayer' into 'nnet.cnn.layer.Convolution2DLayer'
### The network includes the following layers:
     1   'imageinput'        Image Input         28×28×1 images with 'zerocenter' normalization                (SW Layer)
     2   'conv_1'            2-D Convolution     8 3×3×1 convolutions with stride [1  1] and padding 'same'    (HW Layer)
     3   'relu_1'            ReLU                ReLU                                                          (HW Layer)
     4   'maxpool_1'         2-D Max Pooling     2×2 max pooling with stride [2  2] and padding [0  0  0  0]   (HW Layer)
     5   'conv_2'            2-D Convolution     16 3×3×8 convolutions with stride [1  1] and padding 'same'   (HW Layer)
     6   'relu_2'            ReLU                ReLU                                                          (HW Layer)
     7   'maxpool_2'         2-D Max Pooling     2×2 max pooling with stride [2  2] and padding [0  0  0  0]   (HW Layer)
     8   'conv_3'            2-D Convolution     32 3×3×16 convolutions with stride [1  1] and padding 'same'  (HW Layer)
     9   'relu_3'            ReLU                ReLU                                                          (HW Layer)
    10   'fc'                Fully Connected     10 fully connected layer                                      (HW Layer)
    11   'softmax'           Softmax             softmax                                                       (SW Layer)
    12   'Output1_softmax'   Regression Output   mean-squared-error                                            (SW Layer)
                                                                                                             
### Notice: The layer 'imageinput' with type 'nnet.cnn.layer.ImageInputLayer' is implemented in software.
### Notice: The layer 'softmax' with type 'nnet.cnn.layer.SoftmaxLayer' is implemented in software.
### Notice: The layer 'Output1_softmax' with type 'nnet.cnn.layer.RegressionOutputLayer' is implemented in software.


              Deep Learning Processor Estimator Performance Results

                   LastFrameLatency(cycles)   LastFrameLatency(seconds)       FramesNum      Total Latency     Frames/s
                         -------------             -------------              ---------        ---------       ---------
Network                      17886                  0.00007                       1              2e+04          13977.0
    conv_1                    4392                  0.00002 
    maxpool_1                 2878                  0.00001 
    conv_2                    2353                  0.00001 
    maxpool_2                 2267                  0.00001 
    conv_3                    2652                  0.00001 
    fc                        3344                  0.00001 
 * The clock frequency of the DL processor is: 250MHz

To estimate the resource use of the zcu102_int8 bitstream, use the estimateResources method of the dlhdl.ProcessorConfig object. The method returns the estimated DSP slice and BRAM usage.

estimateResources(referencehPC)
              Deep Learning Processor Estimator Resource Results

                             DSPs          Block RAM*     LUTs(CLB/ALUT)  
                        -------------    -------------    ------------- 
Available                    2520              912           274080
                        -------------    -------------    ------------- 
DL_Processor                804( 32%)        388( 43%)     142494( 52%)
* Block RAM represents Block RAM tiles in Xilinx devices and Block RAM bits in Intel devices

The estimated performance is 13977 FPS and the estimated resource use counts are:

  • Digital signal processor (DSP) slice count - 804

  • Block random access memory (BRAM) count -388

The estimated DSP slice count and BRAM count use exceeds the target device resource budget. Customize the bitstream configuration to reduce resource use.

Create Custom Bitstream Configuration

To create a custom processor configuration, use dlhdl.ProcessorConfig class. To learn about the modifiable parameters of the processor configuration, see getModuleProperty and setModuleProperty.

To reduce the resource use for the custom bitstream, modify the KernelDataType property for the conv, fc, and adder modules. Modify the ConvThreadNumber property to reduce DSP slice count. Reduce the InputMemorySize and OutputMemorySize properties for the conv module to reduce the BRAM count.

customhPC = dlhdl.ProcessorConfig;
customhPC.ProcessorDataType = 'int8';
customhPC.setModuleProperty('conv','ConvThreadNumber',4);
customhPC.setModuleProperty('conv','InputMemorySize',[30 30 1]);
customhPC.setModuleProperty('conv','OutputMemorySize',[30 30 1]);
customhPC
customhPC = 
                    Processing Module "conv"
                            ModuleGeneration: 'on'
                          LRNBlockGeneration: 'off'
                 SegmentationBlockGeneration: 'on'
                            ConvThreadNumber: 4
                             InputMemorySize: [30 30 1]
                            OutputMemorySize: [30 30 1]
                            FeatureSizeLimit: 2048

                      Processing Module "fc"
                            ModuleGeneration: 'on'
                      SoftmaxBlockGeneration: 'off'
                              FCThreadNumber: 4
                             InputMemorySize: 25088
                            OutputMemorySize: 4096

                  Processing Module "custom"
                            ModuleGeneration: 'on'
                                    Addition: 'on'
                                   MishLayer: 'off'
                              Multiplication: 'on'
                                    Resize2D: 'off'
                                     Sigmoid: 'off'
                                  SwishLayer: 'off'
                                   TanhLayer: 'off'
                             InputMemorySize: 40
                            OutputMemorySize: 120

              Processor Top Level Properties
                              RunTimeControl: 'register'
                               RunTimeStatus: 'register'
                          InputStreamControl: 'register'
                         OutputStreamControl: 'register'
                                SetupControl: 'register'
                           ProcessorDataType: 'int8'
                            UseVendorLibrary: 'on'

                     System Level Properties
                              TargetPlatform: 'Xilinx Zynq UltraScale+ MPSoC ZCU102 Evaluation Kit'
                             TargetFrequency: 200
                               SynthesisTool: 'Xilinx Vivado'
                             ReferenceDesign: 'AXI-Stream DDR Memory Access : 3-AXIM'
                     SynthesisToolChipFamily: 'Zynq UltraScale+'
                     SynthesisToolDeviceName: 'xczu9eg-ffvb1156-2-e'
                    SynthesisToolPackageName: ''
                     SynthesisToolSpeedValue: ''

Estimate Network Performance and Resource Utilization for Custom Bitstream Configuration

Estimate the performance of the digits series network for the custom bitstream.

estimatePerformance(customhPC,dlquantObj)
### An output layer called 'Output1_softmax' of type 'nnet.cnn.layer.RegressionOutputLayer' has been added to the provided network. This layer performs no operation during prediction and thus does not affect the output of the network.
### Optimizing network: Fused 'nnet.cnn.layer.BatchNormalizationLayer' into 'nnet.cnn.layer.Convolution2DLayer'
### The network includes the following layers:
     1   'imageinput'        Image Input         28×28×1 images with 'zerocenter' normalization                (SW Layer)
     2   'conv_1'            2-D Convolution     8 3×3×1 convolutions with stride [1  1] and padding 'same'    (HW Layer)
     3   'relu_1'            ReLU                ReLU                                                          (HW Layer)
     4   'maxpool_1'         2-D Max Pooling     2×2 max pooling with stride [2  2] and padding [0  0  0  0]   (HW Layer)
     5   'conv_2'            2-D Convolution     16 3×3×8 convolutions with stride [1  1] and padding 'same'   (HW Layer)
     6   'relu_2'            ReLU                ReLU                                                          (HW Layer)
     7   'maxpool_2'         2-D Max Pooling     2×2 max pooling with stride [2  2] and padding [0  0  0  0]   (HW Layer)
     8   'conv_3'            2-D Convolution     32 3×3×16 convolutions with stride [1  1] and padding 'same'  (HW Layer)
     9   'relu_3'            ReLU                ReLU                                                          (HW Layer)
    10   'fc'                Fully Connected     10 fully connected layer                                      (HW Layer)
    11   'softmax'           Softmax             softmax                                                       (SW Layer)
    12   'Output1_softmax'   Regression Output   mean-squared-error                                            (SW Layer)
                                                                                                             
### Notice: The layer 'imageinput' with type 'nnet.cnn.layer.ImageInputLayer' is implemented in software.
### Notice: The layer 'softmax' with type 'nnet.cnn.layer.SoftmaxLayer' is implemented in software.
### Notice: The layer 'Output1_softmax' with type 'nnet.cnn.layer.RegressionOutputLayer' is implemented in software.


              Deep Learning Processor Estimator Performance Results

                   LastFrameLatency(cycles)   LastFrameLatency(seconds)       FramesNum      Total Latency     Frames/s
                         -------------             -------------              ---------        ---------       ---------
Network                     403746                  0.00202                       1             403746            495.4
    conv_1                   26224                  0.00013 
    maxpool_1                31952                  0.00016 
    conv_2                   45120                  0.00023 
    maxpool_2                22409                  0.00011 
    conv_3                  269749                  0.00135 
    fc                        8292                  0.00004 
 * The clock frequency of the DL processor is: 200MHz

Estimate the resource use of the custom bitstream.

estimateResources(customhPC)
              Deep Learning Processor Estimator Resource Results

                             DSPs          Block RAM*     LUTs(CLB/ALUT)  
                        -------------    -------------    ------------- 
Available                    2520              912           274080
                        -------------    -------------    ------------- 
DL_Processor                192(  8%)        108( 12%)      56270( 21%)
* Block RAM represents Block RAM tiles in Xilinx devices and Block RAM bits in Intel devices

The estimated performance is 494 FPS and the estimated resource use counts are:

  • Digital signal processor (DSP) slice count - 192

  • Block random access memory (BRAM) count -108

The estimated resources of the customized bitstream match the user target device resource budget and the estimated performance matches the target network performance.

See Also

| |

Related Topics