Description

Design, Evaluate and Implement SoC Hardware and Software Architectures

From the series: Signal Processing and Wireless – Webinar Series

Overview

The competing demands of functional innovation, aggressive schedules, and product quality have significantly strained traditional FPGA, ASIC, and SoC development workflows.

This talk shows how you can use Model-Based Design with MATLAB^® and Simulink^® for algorithm and system-level design and verification.

Highlights

Verify the functionality of algorithms in the system context and refine algorithms with data types and architectures suitable for FPGA, ASIC, and SoC implementation
Simulate hardware and software architectures, including memory, internal/external connectivity, and task scheduling
Generate optimized, readable HDL code for Programmable Logic and C/C++ code for Processor System, exploring tradeoffs
Verify and validate system-level models through Co-simulation, System Verilog DPI generation, FPGA-in the-loop and FPGA data capture options

About the Presenter

Kishore Siddani is an application engineer at MathWorks India specializing in design and implementation of signal processing and communications applications. He works closely with customers across domains to help them adopt MATLAB^® and Simulink^® in their workflows. Prior to joining MathWorks, Kishore worked for Huawei Technologies, handling telecom clients globally in around 7 countries. He was also an engineer at Uurmi Solutions, where he was involved in the design and development of custom OFDM communication system for defense applications. Kishore holds a bachelor’s degree in electronics and communication engineering from Jawaharlal Nehru Technological University Kakinada.

Recorded: 28 Nov 2020

Full Transcript

Myself, Kishore Siddani, Application Engineer at MathWorks, focusing on wireless communications and FPGA design areas.

And as a part of this talk, we shall discuss on model-based design approach for SoC development, and see how to adopt this approach by taking a Range Doppler Radar system development as an example. We shall then walk through code generation, verification aspects as well.

For the Q&A part, if you have any questions at any time during the session, please feel free to post them in the Q&A panel. Our panelists will be monitoring the Q&A panel and will be answering them. And I'll also answer the rest of the questions at the end of the session.

Without much ado, let's get started with looking at the SoC applications. The system on chip, with its compact architecture, local consumption, and high-performance capabilities, finds its inroads into various crucial applications, like 5G, LTE comms, aerospace and defense, test and measurement. If you see these applications, they are some serious applications. And speeding up the development process and ensuring zero defects is mission-critical with these applications.

Now, let's look at the conventional SoC design work flow. It all starts with the algorithm design, which is done based on a set of functional requirements on the system specifications.

And then the system architects divide the algorithm into hardware and software partitions. Then, manually writing the HDL code and C code and verifying the corresponding functionality.

And then integrating this with the other peripherals, based on the reference designs of the board. And finally, prototyping on the hardware.

In this conventional workflow, hardware and software parts don't come together until the prototyping stage. And that points us to look at the challenges in SoC design.

Now let's look at the challenges in SoC design. The first one is decision on hardware software architecture itself. How do I decide what runs on FPGA, and what goes onto processor? And also, which architecture gives me low latency, and which architecture is less complex to implement.

Moreover, changing the architecture means complete reprogramming, and that, again, implying longer development cycles.

And the next challenge is interface between hardware and software paths. In the conventional flow, we are implementing hardware and software in separate paths. How do we ensure reliable transfer of data between FPGA and processor? And how do we identify processing bottlenecks? These are the challenges.

And the next one is about the verification of the design. Verifying complete design together with the effects of hardware and software architectures is only possible after prototyping on hardware, leading to late detection of errors and fixing them can be costly, both in terms of time and also resources.

Lastly, collaboration. How can hardware, software, and interface teams all work together effectively for a better design or a better throughput?

So in order to address these challenges, we need a design methodology that integrates hardware, software, and other components at the system level, and that should be iterative friendly approach. And also that could offer us early verification.

And model-based design approach does exactly this. Let's look at the model based design flow for SoC development. It starts with the system level specification, wherein we define our input output algorithm, et cetera. And then we look at how we would partition the algorithm, based on analyzing the available options. And then we move into design and implementation of that.

You can see here, hardware software, and interface between hardware and software are all shown in parallel, meaning that we are handling all of them in unison.

And finally, prototyping onto the hardware. Associated with these steps is an all important and integrated testing verification at each phase. This is about the model-based design flow.

With that introduction about model-based design work flow for SoC development. Now let's look at a target application to be developed using this approach, which is a Range Doppler Radar.

Just to give you an overview of what we mean by Range-Doppler here is that this is a pulsed radar system for estimating the range and velocity of some target objects. Let's pickup ZCU111 RFSoC as the target device for implementing our system.

And the system specifications for our application are 125 megahertz bandwidth at X-band, which is very much doable with the RFSoC. And the five kilometers max range, which is going to determine things like PRF. And we want to integrate up to 512 pulses per coherent processing interval. That's related with the velocity estimation accuracy. I mean, the more pulses you integrate, more accurate is the velocity estimate.

We have the Range-Doppler Radar application and its specifications beforehand. So know what we are going to do is we are going to put our system engineer hats for this exercise here. And start looking at some of the questions that would pop up while investigating about the applications.

So you would be thinking about, what does the algorithm look like, first? And what would be implemented in the FPGA? And what should be going on to the processor? And which approach is giving me best throughput? And which approach is the easiest to implement? We are going to find answers to these questions as we go through our design methodologies section.

Back to our exploration of using model based design for SoC development in Range-Doppler Radar example. Let's start with a system specification.

So we start with the system level model, considering the specifications. What I did create here is a high level behavioral model that we will use as a reference for our implementation later on. So the key word here being behavioral.

So what this means is, we are not thinking about the implementation in terms of what's going to run on FPGA, and what's in software quite yet. We're just considering the signal processing aspects that's required to be modeled in the RF domain and in DSP domain.

What I'll do now is I'll open up the Simulink model, and will walk you through the behavioral system model that I created.

So here is the behavioral model that I built up. Let's just run it real quick. So you can see here on the left side is the Range-Doppler map of the model target. I have two targets. One is at 3,400 meters at 150 meters per second. And the other one is at two kilometers, moving at a speed of 100 meters per second in the opposite direction.

And on the right side is the detection map. This would be the output of CFAR detector that take this inputs and output the binary pictures of where we think that the targets are located. Now let's dive in and see how this model is built and works.

We start with a waveform block here that generates linear FM waveform for the specified pulse width and the PRF.

We then send this into radar target model, wherein we model the targets for their position and velocity, together with incorporating the free space channel effects. And then returning it to radar receiver.

So basically, what we are doing here is that we are simulating at a high level the waveform, how it actually looks like in the RF domain.

And here comes the DSP domain of the system. We do Range-Doppler processing that involves range gating to eliminate the close returns And then buffering the incoming samples, and then sending them to the Range-Doppler response block.

So this is a high level algorithm block from phased array system toolbox that takes a 2D input and outputs the 2D range Doppler map that we saw earlier.

And then finally, we send this to a 2D CFAR detector block that estimates the range and velocity of the targets, and then plotting those as output.

We have done some exploration at the system level. Now let's move into how to partition the algorithm. Let's start with a bit of algorithm analysis here first. We have range samples coming in for an individual pulse over time and over multiple pulse intervals. And along the range dimension, we are performing a matched filter.

And as we are doing filtering directly on the incoming samples, we can just put our filter in line with the ADC stream, and range can be computed immediately. On the pulse interval dimension, we are doing an FFT for Doppler computation.

But the thing to note here is that we can't do FFT until we have captured all our samples. That's because we need the entire frame to be present before we can compute the FFT along this dimension.

That brings us to laying out two options here for us how we would partition the algorithm. Option A, implementing matched filter and FFT in the FPGA portion, and passing on the output of FFT to ARM processor for detection.

Whereas we have option B, wherein we are doing matched filter in the FPGA, and output of our matched filter is sent to ARM processor for FFT and detection operations.

So if we were to analyze this between these two options in the conventional flow, like prototyping on hardware, first with the option A, and then complete reprogramming and prototyping with option B. It would have significantly slowed down the development process.

So here comes an important question and what I expect also to be one of the key takeaways of this talk is, how can I make some of the high level design decisions before touching the actual hardware? What I mean is that before prototyping complete system on the hardware, do I have a way to evaluate different design options? Like I said, we have two options now.

So the answer is yes. And let's look at one of the tools available in MATLAB and Simulink that enables us to make these design decisions.

SoC Blockset is the tool that I was referring to. It was introduced in 2019a release, and it provides Simulink blocks and visualization tools for modeling, simulating, and analyzing hardware and software architectures for SoCs. You can build and simulate memory and internal external connectivity, as well as scheduling and OS effects.

Once you have your architecture, you can quickly explore things like estimating interface complexity for hardware, software partitioning, and evaluating software performance and the hardware utilization on the model.

So let's see how we can use this SoC Blockset for analyzing our options. Coming back to the options A, matched filter and FFT, both on FPGA. We saw that to do FFT, we have to capture entire data matrix. And if we think about our original system specifications, we need to integrate over 512 pulses. And doing all the computation, that's 48 Mb of data that we need to store.

And if we first check, if we can store that in the block RAM of FPGA itself, then it would be the simpler implementation.

But if we look at ZCU28DR device type, we see that it has only 38 Mb block RAM, meaning that we have to store our data now in external DDR memory.

And as we are going off the chip, we need to interface with the memory controller. And another thing to note here is that we can write the data to DDR in order, as the data comes in. But we need to read the data in transposed radar to do the FFT along the pulse interval dimension. And then we just need to keep a flag on that, reading in the transposed radar.

So what we can do to estimate timing for this DDR operation is that we can build a hardware model using couple of blocks from SoC Blockset. Let me build up the model to emulate this pattern and show you.

So I have got a model here with a memory controller block, and two traffic generator blocks to emulate the traffic pattern of in order write, and transposed read. And when I open up the memory controller block, you can see here the hardware board is listed as ZCU111.

And what it means is that with SoC Blockset when you're using a target board, you have the advantage of simulating the behavior of memory control specific to that board. If you have a specific case, you can uncheck this and define the behavior for yourself as well.

After you run the simulation, you can open up and check the performance result. Actually here, I ran the simulation already to save some time. So we can go ahead and check the result.

I am plotting here bandwidth over time for master 1 and master 2. We can see here that the bandwidth for a write operation is very high, at 500 megabytes per second. And on the other hand, bandwidth for read operation is very low, at around 25 megabytes per second.

This is, again, transmitting the same amount of data between the read and write operation, but the read is going to be significantly slower because when we are reading transposed order, we are reading one word at a time and can't really taking advantage of data existing continuously in the memory.

This simulation is for a smaller total data size, just to speed up the simulation. But you get the same picture when simulating for the required data size as well.

So for a complete data size, our DDR operation would require around 438 milliseconds. So that's for the option A. The latency is 438 milliseconds.

Then, let's look at option B here, which is software based FFT. The target board, RFSoC, has a quad core ARM processor. So what we can do here is distribute the FFT workload between those cores for a four times speed-up, compared to just running on only one core.

And if we look at the data size, we have to perform 512 length FFT, 4096 times. So we can split the range bins into groups of 1,024, and compute the FFTs for those bins in parallel. And then finally, integrate into a large data frame at the end.

So what we can do to estimate the timing of that FFT in software. The more accurate way to know is to just measure it using a profiling tool. We can do that easily using the Processor in the Loop feature from Embedded Coder

It's really simple to build the Simulink model for FFT, generate the C code, run it on the processor, and get the timing results. So as we do this, we get the average execution time for doing one of these 1024 by 512 frames is about 476 milliseconds. So the option B is showing up a latency of 476 milliseconds.

Finally, here, we are going to compare the two options. FPGA FFT versus software FFT. And we saw that the FPGA option here had a lower latency. But as we talked earlier, latency is not only the factor to consider. We saw how easy it was to plug in FFT block in Simulink for implementation in the processor. And just because we are taking only tens of milliseconds latency hit, we can go ahead with the option B as a starting point for us now.

By now, we have decided upon the algorithm partition. What goes onto the FPGA, and what goes onto the processor. And now, let's see how we can build a system level architecture, integrating hardware, software, and the interface together.

Again, with SoC Blockset, you can use the existing template or follow guidelines to create system level models for SoC applications, wherein you can bring together hardware user logic, software algorithm, a memory system, and the I/O devices.

If we look at it from our example radar application context, and the implementation of option B, we have the matched filter sitting under the hood of FPGA here. And FFT and detection algorithm on the processor. And the memory channel between them to model the transfer latency.

Also, on the processor side, you can model task execution using the PIL timing results that are available to us. So that it emulates behavior of a real processor on the board. Similarly, you can bring in the I/O devices effects as well.

What we have now is a complete system level model, together with the algorithm. So you can now simulate and analyze for system level latency, evaluate software side performance, and keep exploring different architectures.

What shall we do now is we'll see how to generate HDL code for the FPGA sub system.

So HDL Coder is the one that can generate target independent HDL code from Simulink models and MATLAB functions. What I mean by target independent is that the generated code is not specific to any Xilinx FPGA or Intel FPGA. It's a standard code.

And talking about the verification area. It can generate the HDL test-bench. And it can also create co-simulation environment to co-simulate the generated code with the third party simulator tools.

And then the design automation, it means that it integrates with the synthesis tools, like Xilinx or the Microsemi or the Intel synthesis tools. And automate the entire bit-file generation. Actually, we'll take up the same example, and we'll see how to generate the code on this.

Here is the model. In order to generate the HDL code, go to the top system that is targeted to the FPGA. Right click and open HDL Workflow Advisor.

So the HDL Workflow Advisor guides you through the series of steps required for code generation. In the first step, you can select the target workflow as generic ASIC FPGA for HDL code generation. Then select that synthesis tool and the target device as well as the target frequency at which you want to run the model.

In the second step, prepare model for HDL code generation. Workflow advisor, runs through a series of checks, to check the model compatibility with HDL code generation.

In the third step, we have set of code generation options, like what is the generated language, VHDL or Verilog. What kind of reports do I need? Or what are the clock settings as part of advanced options. What are the different optimization options available and the test bench options.

Finally, you can right click on the generate RTL code, and click Run to selected task.

Here, this opens up the code generation report with the generated source files. When you open the code, you can see bit true cycle accurate, editable, readable, and well commented code here.

And one of the key features here is that there is a bidirectional traceability between the model and the generated code. What I mean is that if I click on a block, it points us to the corresponding code. Similarly, if I click on a part of code, it points to the corresponding block in the model.

And here, we have timing and area report, a high level resource report is here showing the resource utilization on the device. And a critical path estimation showing the critical path delay, and if you click this link, it highlights the critical path where it was in the model. Accordingly, you can choose, maybe you can need to add pipeline registers and do the optimization further.

We have seen how HDL Coder gave us editable, readable, and bit true cycle accurate, HDL code. Once we have the code, the next thing that comes into our mind is verification.

So here, we discuss about RTL verification steps. The first one is a behavioral code simulation. What I mean by behavioral co-simulation is that you have a Simulink model, and you're running it normally, together with the stimulus to it, and taking the output of that Simulink model.

Now you run as normal, and you also invoke a third party HDL simulator to run the code that is generated from the same model. So you are passing the same stimulus to the HDL simulator that is running the RTL code, and you take back the output from that HDL simulator back into the Simulink.

So you compare with model output, again as the output from this HDL simulator. So it's like checking your RTL code against the model reference.

thats one kind of verification you can do.

And then, other one is you can use HDL Verifier to generate SystemVerilog DPI components to help build an ITL verification environment. What I mean is that you can generate C code from the input stimulus and the scoreboard, and then wrap it up in the direct programming interface, DPI C components.

So these models are run in any simulator that supports SystemVerilog DPI. We test with Mentor, Cadence and Synopsys

So the advantage is when you need to make a change to the algorithm, you can quickly regenerate the C code and wrap up in the DPI component and use it for the verification. Mainly used in the UVM verification methodologies.

You have the code, and you have verified that. And next one that pops up into mind is integrating that with the synthesis tools. I mean, how do you synthesize that code? And how do you generate the bit files?

So here, as a HDL Workflow Advisor itself, when you select another workflow, you can invoke these third party synthesis tools in the background, like the Workflow Advisor invokes the third party tools in new shell windows. And all the background processes, like synthesis, place and route and bit file generation can be automated in the background, and you will have a programming file within the HDL Workflow Advisor itself.

So once we have the bitfile, what you can do to think of is, You can do it on FPGA in the loop kind of verification, wherein you put the algorithm onto the hardware. And create a testing environment such that you drive the data source from the MATLAB or Simulink. Run the algorithm on the input, and get back the output from the hardware board back into the Simulink for further analysis.

It is one of the verification steps, and there is also FPGA data capture, wherein you can tap on any particular signal from the free running FPGA and bring it back into the MATLAB or Simulink. And can use it for further analysis.

So until now, we are seeing one side of the implementation and prototyping of the algorithm on the hardware. That is, we are using HDL Coder to generate VHDL or Verilog, to port it onto the FPGA or the SoC.

On the other hand, we have an embedded coder that can generate optimized C or C++ codes for the processor side of the SoC or for the DSPs.

So together, HDL Coder and Embedded Coder can target onto the hardware and the software sides.

Here comes, actually, a user story from one of our European customers. It's called Orolia. So what they have done is they have adopted model-based design with MATLAB and Simulink, and designed a specialized SDR.

And you can see the results as quoted, like analysis and testing times shortened by eight months. And FPGA implementation time for them has reduced by at least 50%. We'll move ahead.

Now, let's go to the next stage in the model based design for SoC development, that is the prototyping.

So once we have the final system level model with the fine tuned algorithm to meet the system's specifications, then the next step is to prototype it on the SoC device. And SoC builder guides you through the steps required to build the hardware and software executables, load them on an SoC devices, and then it execute the application there on the device.

So I'll quickly run over the steps that we do as part of final deployment.

So here is our model. I can go and click on Configure, Build and Deploy to open the SoC builder.

So it opens up and gives us options to select between build model or load existing binaries if you have already built earlier. And here it shows the model architecture. You can click Next.

And then here, you can review the task map and assign the event driven tasks to available sources of interest.

And then click Next. So this is to review the memory map, where you can review memory addresses and offsets that are correct or not. If there is any change required, you can make it over here. And then click Next.

Here, you're specifying your folder name for your project files to come and load. Here there are options again, like only build, build, load and, run, and build and load for external model. I'll select only build to generate the files for now.

And then, the next step is Validate. The SoC builder validates the model for all these checks. It's validating.

And the next one is building. So now it's built the software side application and generate IPCore, create Vivado project, or any third party synthesis tools projects.

And it will also launch external shell to call this Vivado. So this is the window. We can move it.

The external shell is running the synthesis, and once the bit stream generation is completed, it exited the Vivado. And finally, we will have the files built into our system. We'll see the files then. The software files and the Vivado project files.

So here is the C code file. Oh, let me bring it up. So here is the software applications side C code code file that goes onto the processor. And the name that we have given, soc_prj. Here we can find Vivado project file.

That's about prototyping. And SoC Blockset supports the boards that are listed here off the shelf. Now, currently, we are using RFSoC with ZCU111.

And by any chance if you are working on a custom hardware board and it is not a part of this one, you have a defined workflow. Go to select Custom hardware board and add the reference designs of your board into the platform. And you can use it as if it was supported. Or if you need a support, reach out to us, and we can support on the custom hardware boards.

And just to summarize, we have walked through the model-based design for SoC development. We have looked at the analysis of algorithm partitioning options. We have analyzed between option A and option B for our example.

And then, we built a system architecture integrating hardware, software, and interface together. And letting us simulate and verify the algorithm, together with the system level effects. So that we can able to see what is the interface bandwidth effect, or what is the task over-run, and something like that. We will be able to see at this stage. And finally, we have seen generating executables and binaries and deploying onto the hardware.

And one thing is that the tools that I have been discussing through this model-based design are much more times is SoC Blockset, which is making the model-based design approach attainable. So SoC Blockset is the one that designs, evaluates, and implement SoC hardware and software architecture.

And there are three main aspects of SoC Blockset. It's simulates the hardware architecture, deploy automatically on Xilinx and Intel devices. And finally, profile the performance on the hardware.

That brings us here. So here are the resources to help you get started on the variety of topics that we discussed here. You can go to the MathWorks website and see the further recordings of the videos of our webinars. And read white paper or articles.

And also there are a plethora of shipping examples that give you complete confidence on whatever the work flow that we have discussed. And each of the work flows has a shipping examples, so you can try it out.

If you would like to exploit the full potential of these tools in a more hands on kind of experience, then you can also consider our training offerings. We have a series of web interactable trainings going on at present, that cater to quite a variety of topics, like DSP for FPGAs, generating HDL code from Simulink, programming Xilinx Zynq SoCs. If you are interested in taking these trainings, you can reach out to us, or you can also look at our website for more details.

And on top of that, also we offer consulting services to help you accelerate your project development activities, and also help you achieve your faster results. If you are interested in talking to us on some of your projects, we would be happy to support you. And also you can log in to our website and see what are the different customer success stories we have within the consulting. And you can also take some of those learnings from there to your home project.