Accelerate AI based software development on AURIXTM TC4x with Model-Based Design and optimized code
The Infineon AURIX™ TC4x family of microcontrollers is suited for a wide range of automotive and industrial applications. The integrated Parallel Processing Unit (PPU) includes artificial intelligence (AI) capabilities required for next generation control tasks, including real-time motor control. This webinar showcases an example of estimating the rotor position of a motor in real-time with the TC4x. The AI-based software performing the motor control functions can be fully developed using a Model-Based Design approach.
The webinar also describes the tools that have been optimized for TC4x software development: Synopsys’ ARC® MetaWare for AURIX™ TC4x and MathWorks’ Embedded Coder® and SoC Blockset™ Support Packages for AURIX™ TC4x.
This webinar is presented jointly by Infineon, Synopsys, and MathWorks.
Within the workflow for the motor control development,
- Infineon will provide an overview of the architecture of the AURIX™ TC4x family of microcontrollers.
- Synopsys® will provide an overview of the Parallel Processing Unit (PPU), libraries and tooling.
- MathWorks will cover an overview of Model-Based Design and target-optimized automatic code generation.
About the Presenters
Dr. Kajetan Nürnberger, Principal Engineer, Infineon
Before joining the automotive semiconductor world Dr. Kajetan Nürnberger worked on avionic software both in an industrial and academic environment. In Infineon Technologies he started in the software predevelopment team for the AURIX™. He is now working in the application engineering enabling the full compute resources of the TC4x.
Guy Ben Haim, Staff Product Manager, Synopsys®
Guy is a Staff Product Manager with extensive engineering experience of more than 25 years in the semiconductor industry. Guy is responsible for marketing and business development of Synopsys ARC® software products supporting a portfolio of configurable processor IP cores that integrate high efficiency RISC, DSP and AI/ML accelerators. Guy drives product vision, definition and strategy, market trends research, and works closely with cross-functional teams to deliver high-quality ARC Processor software development products that meet the needs of Synopsys’ customers. Guy holds a Master of Engineering in Computer Networks from Toronto Metropolitan University, an MBA from the Technion Israel Institute of Technology, and a B.Sc from Tel-Aviv University.
Murat Belge, Consulting Engineer, MathWorks
Murat is a consulting engineer at the MathWorks. He is the technology team lead for SoC Blockset Support Package for Infineon Aurix TC4x Microcontrollers. He worked in MathWorks for over 18 years in various areas including code generation for embedded systems, networking protocols, middleware representation and integration, IOT, device driver abstraction and integration, targeting embedded Linux systems, embedded vision and motor control. His experience includes developing embedded vision algorithms for SIMD processors and developing time domain reflectometry diagnostics systems for digital subscriber lines. Belge received his Ph.D. from Northeastern University, M.S. and B.S. from Bilkent University in Turkey, all in electrical engineering.
Recorded: 26 Jun 2023
Hello, and thanks for joining us. Today we'll be discussing how you can accelerate AI-based software development on AURIX TC4x devices with model-based design and optimized code. I'm John Kluza from MathWorks, where I manage ecosystem partners. And I'll be moderating today's webinar.
It will be a joint session with speakers from Infineon, Synopsis, and MathWorks to give you a view from different important perspectives. We'll start with Dr. Kajetan Nümberger from Infineon, who's a principal engineer on the application engineering team enabling customers to use the full compute resources of the TC4x. He'll give us an overview of the architecture and capabilities of the TC4x device.
Then we'll go to Guy Ben Haim, from Synopsys, who is a staff product manager that's responsible for marketing and business development on the Synopsis ARC Software products. He'll explain the Parallel Processing Unit, or PPU included on the TC4x and other associated software tools.
Our third speaker will be Murat Belge from MathWorks, who's a consulting engineer that's the technology team lead for MathWorks SOC Blockset support package for AURIX TC4x, as well as working on other areas of embedded systems. They'll discuss model-based design and target-optimized code generation and walk through a demo using AI on the PPU of the TC4x to estimate motor rotor position.
And finally, we'll go to a quick wrap up and some time for Q&A. So with that, I'll hand it off to our first presenter. The floor is yours, Kajetan.
Thanks, John, for the introduction. My name is Kajetan Nümberger and I work for Infineon. Before we dive into the PPU ecosystem, let me introduce to you the AURIX TC4x. This is our latest generation of automotive microcontroller. The AURIX TC4x is motivated by some key trends and challenges in the automotive industry.
On the first hand, we have the electric mobility, where the complete drivetrain of the car is replaced with an electric one. This brings several new challenges and control loops to the microcontroller.
The most obvious one is the traction inverter, which controls the electric drive motor, but also, the battery is added as a new system to the car. This battery needs to be charged, and all the discharging process needs to be controlled. In addition, all the things like state of charge, and state of health of the battery needs to be maintained. This is also executed on microcontroller platform.
One other big new trend in the automotive industry is the whole story about autonomous driving and advanced driver assistance systems. Here, we see that functionalities which are available in premium cars and mostly implemented on microprocessor unit, now drop down to mass-market cars.
Here, you see a significant cost pressure on these functionalities. These you can address if you move from microprocessor units to microcontroller units. And therefore, the microcontroller unit needs to have a big enhancement in the computational power.
One other trend, not necessarily related to new functionality, is the complete restructuring of the electronic car architecture. Here, the main goal is to fuse different ECUs into single controllers, dependent on if is done by field, where the controller is working, or by the location where the controller is located, you either speak from a domain or a zone controller.
Similar to both, is that it fuses several ECUs into one microcontroller, which also enhances the compute requirements for the single microcontroller. To address these challenges, we introduced the TC4x.
On this slide, you can see an overview over the AURIX TC4x. Beside a rich set of automotive communication interfaces, we also have some dedicated busses and peripherals to meet the tight execution time requirements you have in the field of the XEV applications.
Also security becomes more and more topic in the whole automotive world. Therefore, also, the cybersecurity functionalities enhanced significantly in comparison to the previous generations of the AURIX. But our main focus today, is the complete compute cluster of the AURIX.
Here, we still feature our TriCore. And same as the previous generation, we have up to six TriCores in the AURIX TC4x beside of a frequency increase, although, the functionality of the TriCore is increased that you get more performance out of it.
In addition to this, we also added the completely new parallel processing unit, which is our main focus today. The parallel processing unit is a so-called SIMD machine, which means that you can execute the same instruction to multiple data in the same clock cycle. Such a machine can be used in various applications. Some examples you can find on this slide.
We draw them over the different execution time bounds and the estimated number of math operation. In addition, this slide also gives you a grouping between complex data processing algorithm, mainly based on linear algebra operations, and applications which are based on neural networks was AI.
On the left-hand side, you have one example of the XEV domain, which is the onboard charger. Here you have a super-tight execution time requirement, especially when you want to utilize the new features of the wide bandgap semiconductor materials.
On the other hand of the execution time scale, you have things like the domain control and zone control, where, for example, trajectory planning in the field of ADAS is located. Here you have less system dynamics and therefore, a little bit relaxed execution time.
But as the system has a higher degree of freedom, you also have an increased number of mathematical operations. So the compute power still needs to be higher, and PPU is a good target to tackle those systems.
In this webinar today, we actually focus somewhere in the middle of this diagram. What we want to show is a combination of the traction motor inverter block and the virtual sensor. I will explain later on a bit more on the example we want to demonstrate today. But first, let's come to the challenges.
Obviously, a SIMD machine does not come totally for free. The biggest problem here is that you really need to tell the compiler what vector and matrices are in order to utilize the hardware in a good and sufficient way. C, the most common language used for programming automotive microcontrollers, have to ask the problem that it does not support vectors and matrices by a special type. Instead, they are abstracted by arrays or pointers.
And here, the compiler often has a hard chop to ensure that they are really not overlapping in memory, and they can really be vectorized. In order to overcome these challenges, there are many parallel programming paradigms like CUDA, OpenCL, or OpenMP.
The problem with all these models is that they are not really state of the art in the automotive industry. And although some of them are really hardware dependent and not so easily be portable. But there is a method, which is accepted in the automotive domain and has knowledge about vector and matrices.
This is the model-based design approach. Here you have dedicated types for vector and matrices, and then for example, the embedded code can be enhanced to keep this information, ported to the C code, and then generate code, which can be efficiently be translated by the compiler to be deployed to the hardware.
How this works in detail, Murat, later on will tell you. But before that, I would like to introduce the example we are tackling today. Today's example is around the field-oriented control for brushless electric motors. A block diagram of this so-called FOC algorithm, you can see on this slide.
Beside the pure control loop, which tries to bring the measured value as close as possible to the reference value, this algorithm also contains some transformations before and after the control loop. Some of these transformation require the actual rotor position. This rotor position is normally measured by a sensor attached to the shaft of the motor.
If you now, for example, for cost reasons want to get rid of this position sensor, you can introduce an estimator for the position. This is shown on the slide. Here, we actually apply an AI-based approach for this position estimator. So we remove the position sensor, and instead, introduce an AI-based position estimation.
This position estimation has as inputs the commanded voltages and the measured currents. With the help of these values, you can then estimate a position of the rotor and use this inside of your field-oriented control algorithm. Removing the position sensor completely is obviously one of the extreme use cases. But there are also different flavors for this.
For example, you still could keep your position end sensor and add the AI-based position estimation in addition. Then you could utilize the measured and the estimated position in safety functionality. In case there is a too high threshold between these two values, you can detect that your system is in a wrong state, and you can bring your complete system into a safe state.
One other use case is, that you can use a cheaper position sensor for the field applications. Then you can use this position information from the simple sensor and feed it in addition to the voltage and currents alterted to a neural network. Then this neural network can calculate enhanced position estimation, which is later used in the field-oriented control.
Such an use case is not so easily possible with already known methods for sensorless electric motor control. One other use case, which is also not so easy, for example, if you compare to an extended Kalman-filter approaches. If you already have an existing motor and you do not know the real parameters of this motor, which would be required for the system model update within the Kalman filter.
As a neural network is trained based on measurements, you would not require these motor parameters, and you could train it even without knowing the concrete parameters of your motor. Now let's dive a little bit deeper and have a look into how the model actually is built.
So the neural network, as already mentioned, has as inputs, the current control pair which is measured from the motor and the commanded voltages, which are also a pair and come out of the field-oriented control. In order to also get a little bit of history, we use not only four signals of one timestamp. Instead, we always use the last nine timestamps. This leads to 36 input neurons of the network.
Now we build up MLP network with a reload activation function, which has six hidden layers all featuring 36 neurons. Only the output layer then reduces to two outputs, which is actually the sine and cosine of deposition, which later on, can be used in the field-oriented control algorithm.
Now I introduce to you the TC4x hardware and a nice use case which can be run on the AURIX hardware. But making something run on the hardware, with having only the hardware, is really a hard job. Therefore, we partnered with Synopsys and MathWorks, which also present in this webinar today.
Synopsys brings the whole compiler infrastructure and a complete SDK for the PPU. And with MathWorks we partner, to support the model-based design, especially with the Embedded Coder, and the SLC blockset. On the MathWorks part, later on, Murat will explain you more. But now, first, let me hand over to Guy, which tells you more about the middleware for AURIX.
Thank you, Kajetan. My name is Guy Ben Haim I'm the Product Manager for Synopsys ARC Software Development products. In this part of the webinar, I'm going to talk about the MetaWare toolkit for AURIX TC4x, which is the complete software development kit for the parallel processing unit, the PPU.
The PPU in the AURIX TC4x is a versatile vector DSP architecture addressing a broad range of DSP applications. It features one scalar and three vector slots. The vector width can be 128, 256, or 512-bit. The PPU comes with very efficient floating point unit arithmetic, including specialized math accelerators for both single precision and half precision.
It supports a 16-bit, 32, and also 8-bit integer as required for AI workloads. In order to support this hardware, the middleware for AURIX TC4x was created along with the rich software, tools, and ecosystem. This toolkit, on the right side of the screen, offers you different components.
A compiler and debugger to generate the binaries for the PPU, sophisticated nSIM simulator to accelerate software development much before the physical hardware exists, a rich set of vector libraries to efficiently utilize the vector DSP architecture.
There's also special interest in the neural network SDK to efficiently deploy neural network to the PPU, a speed runtime to schedule tasks for the PPU, AUTOSAR complex device driver, and a dispatcher for integration to the TriCore to fulfill use cases where the PPU acts as an accelerator.
Of special interest today, are the neural network SDK and vector libraries. To allow the coordination between the TriCore 2 PPU execution, there is an integrated programming environment. The purple components in the diagram are from Synopsys.
There is the AUTOSAR Complex Device Driver, CDD that manages PPU requests and responses to and from the TriCore. The PPU side, the software components are dispatcher that manages the communication between the PPU and the TriCore, the speed runtime that manages the device operations the task execution, and DMA and messaging via the IPC, the interface, neural network runtime for optimized execution of NN models, software test library. This is optional. It provides a mechanism for ASIL certification when the redundant hardware is not present. There is also support for TC49A board and TC4x VDK.
One of the important components in the middleware toolkit for AURIX is the neural network SDK. It is an easy-to-use neural network SDK for AURIX TC4x that automatically compiles and maps the AI network to PPU hardware. It has a robust AI and machine learning software ecosystem supporting TensorFlow, Keras, and Caffe frameworks and additional frameworks are supported via the ONNX-based import.
There is an end-to-end flow that simplifies compilation and deployment of the neural network-based applications to TC4x. This includes automatic import and modal quantization to get efficient performance on the embedded hardware, also advanced optimization techniques to map model for PPU target execution.
Using the flow, it's also possible to test and validate the model accuracy. The flow uses optimized neural network machine learning interface library for the TC4x PPU. And as was shown in previous slides, there is also a simple unified TriCore service deployment API to the PPU.
There are three approaches to translate neural network models to source code. Number one, you can use the language extension to enable auto vectorization by the compiler. And that's part of the MetaWare compiler capabilities.
Number two, on this slide, is using manual implementation with the optimized libraries, PPU-optimized libraries that provide wide range of functions, such as standard C library, DSP, and platform-specific runtime.
Number three is to use the model-based design support, and this will take full advantage of the automatic code generation for TC4x PPU. So that can be done with a TensorFlow or ONNX format pass to MetaWare neural network compiler that outputs an optimized C code to compile with MetaWare.
The flow we are focusing on today, in this webinar, is to start from the TensorFlow model through the Matlab Simulink and the Embedded Coder that will generate the code, and then use the PPU-optimized libraries to then create an efficient and optimized execution on the PPU.
So what you see in this flow diagram, is the model-based development support flow. This really helps to maximize the application performance on the PPU. What makes this flow even more attractive, is the cooperation with MathWorks.
The code generated out of Simulink can call the ARC DSP libraries and so effectively utilize the EV7 hardware. The blue boxes are the MathWorks components, and the purple ones come from Synopsys. Starting from a high level Matlab model, the flow will use the Embedded Coder to generate code and then replace it automatically with the highly optimized functions included in the synopsis MetaWare DSP libraries.
This optimized vector library package offers a huge set of DSP channels to take full advantage of the vector functionality and parallelism of the PPU, for example, to accelerate math operations for multiple data in complex functions. The package also covers the BLAS and the LAPACK functionality. This enables the classic software developer to easily move linear algebra-based algorithms to the PPU.
MetaWare development toolkit compiles the generated vector DSP C code, and this compiled code can run on the PPU target in simulations, for example, nSIM simulator to verify implementation. How exactly this flow works in detail, now, Murat will demonstrate it to you.
Thank you, Guy. Now, I'm going to be talking about model-based design and hardware optimized automatic generation for Infineon AURIX TC4x. First, I'd like to go over model-based design to set the stage. Model-based design is an approach to represent system components and their interactions with their surrounding environment using mathematical models.
There are three key pieces to model-based design. Model and simulate your system, in this phase, you explore design space by modeling the system on the test and the physical plant. Your entire team can use one multi-domain environment to simulate how all parts of the system behave.
Test and verification, reduce expensive prototypes by testing your system under conditions that are otherwise too risky or too time consuming to consider. Validate your design with hardware-in-the-loop testing and rapid prototyping maintain traceability from requirements to design to code.
Automatically generate code, instead of writing thousands of lines of code by hand, automatically generates production called the C code that behaves the same way as the model you created in Simulink, then deploy it directly onto your embedded processor. Simulink and MATLAB act as a platform which enable model-based development.
Now, let's look into design workflow. You can start with a top-to-bottom design approach where you can create requirements, develop a system architecture that realizes the requirements, and then design individual components in Simulink Canvas. MathWorks has products to help you in the design process such as System Composer.
As we saw earlier in the model-based design, testing is involved from the very early stages of design. This helps us answer the following important questions. If the system meets the design requirements? Is it functioning correctly? And is it completely tested?
Once you are satisfied with your design, generate readable compact and fast C/C++ code for embedded processor. This eliminates costly and error-prone manual coding steps. You can use the bidirectional linking in Simulink models to trace how your model is coded or from where in the Simulink model a block of code is generated. In code perspective, you can look at the model and the code side by side, and you can click on any model element to navigate to the corresponding code and vice versa.
MathWorks provides two hardware support packages for Infineon AURIX TC4x. These support packages are your gateway into model-based design with AURIX TC4x microcontrollers. If you are only interested in programming track or MCU on the TC4x processor, you can go with Embedded Coder support package. If you also plan to include PPU in your Simulink designs, then use Soc Blockset support package.
The hardware support package has tight integration with other vendor's tools. The support package contains a wizard that helps you set up required software tools for your workflow. You can use this wizard to validate installed compilers, instruction set simulators, and so on against tested versions in the support package. If you're missing a 3P tool, the setup wizard provides instructions on how to install the tool and directs you to appropriate vendor website if necessary.
Hardware support package contains similar blocks representing device peripherals. These are GPIO, Encoder, TM ADC, PWM, QSPI as of R2023A. Each peripheral is supported with an extensive set of configuration parameters. Peripheral device equals have been set to enable you to quickly start prototyping.
The support package provides peel capability on hardware or on the Synopsis virtualizer for rapid test iterations. We typically add new blocks to the hardware support package at each release, so you will see the number of blocks grow in the future releases.
The hardware support package for Infineon AURIX TC4x includes several application examples. Here, in the snapshot, a field-oriented motor control application is shown. You can simulate the complete closed-loop control system, including the plant model representing a PMSM motor within Simulink environment. This allows you to refine your design in simulation environment where you have access to all signals, parameters, and so on.
This motor control example use a physical encoder to determine motor shaft position. In the rest of this presentation, I will be concentrating on developing an AI-based position estimator intended to run on PPU. As Kajetan pointed out in the opening section, an AI-based position estimator can be used to either reduce hardware cost or introduce redundancy to sensor measurements. PPU is ideally suited for deploying AI networks as it support SIMD vector instructions.
So how are we going to accomplish this task? You have seen this picture before in Guy's section, where he talks about model-based design with AURIX TC4x. Here, I will be following a specific path through the tool stacks. We started with a TensorFlow MLP network, which is then import into Matlab.
Then we developed a Simulink model representing the AI position estimator, simulate and verify results. We then generate code and compile with Synopsys Material Compiler and deploy the generated code to hardware.
Matlab has tools to let you import an MLP network designed outside Matlab. You can use these tools to take a look at closer look at MLP network, as seen in this snapshot, and then you can realize the MLP network in Simulink. I'm going to talk about the Simulink model more in the next section.
Here, we see the top-level Simulink model that will be deployed to PPU. The MLP network developed for motor shaft position estimation uses motor currents and voltages as inputs. We accumulate the current and voltage values in a time-delay buffer of nine samples and feed it to the MLP. MLP has six fully connected layers with radio activation. Each layer has been realized as an atomic subsystem in Simulink, as shown in this snapshot.
You can generate code for the MLP with a few easy steps. Open the Embedded Coder app, go to hardware settings, set the hardware board to Infineon AURIX TC4x Triboards. Ensure that processing unit is set to PPU. When you select PPU as the processing unit, we automatically enable code replacement tables developed for PPU. Here, you see the code to the replacement tables. Close the hardware settings and click on the Build button.
After a while, you will see the generated code in the code perspective. Note the calls to MetaWare Blasts and MetaWare VDSP libraries and Coastal SIMD compiler Kajetan at line 68.
This comparative plain C code with hardware-optimized code. The left side of the view is the plain C code obtained by removing CRL tables. The code snippet on the right, is the hardware optimized code that used CRL tables. You see calls to MetaWare Blast library for matrix multiplication. A call to MetaWare VDSP Library for vector addition, and calls to SIMD compiler intrinsics in the generated code.
We ran the plain C code and the hardware-optimized code on the MetaWare nSIM simulator and collected profiling statistics. The plain C code takes about 100k instruction cycles while the hardware-optimized code takes about 5,800 instruction cycles, giving a speed up of 17x for this example.
We are working on generating PPU-optimized code from DL predict blocks in Deep Learning Toolbox. You will be able to replace the MLP subsystem with a single block while maintaining the same fidelity. This block will generate code optimized for PPU for even greater performance gain than 17x.