Friday, October 12, 2018

RISC-V SoftCPU Contest Part II

Let's look at the contest requirements in closer detail.

The CPU must pass RV32I compliance tests

As reference following github URL is given

https://github.com/riscv/riscv-compliance/tree/master/riscv-test-suite/rv32i

There is a Makefile and two subdirectories: src and references. The src directory includes the source code for tests and the reference directory reference dumps for the tests. We have no other choice as to assume that we need to pass all the tests, using all the test cases from the src directory. If we now read the the riscv compliance suite documentation a bit higher in the same github repository we see that it is required to use the reference signatures:

The only requirement needed in this case is that there must be an option to dump the results from the target in the test environment so as the comparison to test reference signature is possible.

So no matter how we validate our SoftCPU we must provide a way to dump the signature data from the memory after the test runs. We should use ALL the src from the compliance directory and compare with the signatures and we must pass match on all cases. This is the only way to pass the compliance tests.

Good thing is that there is a documentation how to setup the "target" environment for the RISCV compliance tests: https://github.com/riscv/riscv-compliance/tree/master/doc

Documentation is always good to have, right? So what we have in this document? Lets search for the word "verilog" - found, there is a topic about the use of Verilator. Very good, so we follow the given steps to setup verilator as "target" and we are half way done setting up the compliance suite environment, right?

No. The section for the Verilator has one word as content: "tbd" nothing more! Similarly there is section for one existing hardware target with the same word "tbd" as full content.

So, there is no documentation how to setup the RISCV compliance test environment for any RTL simulator or any real hardware target.

Cool, eh? So as part of the contest entry we must implement this without guiding documentation and with no example references. The only "targets" included in official environment are pure instruction set simulators.

What options we have?

First we could use verilator, in that case we should add support to dump out the signature from the memory, this should be doable with some custom c coding, the code would need to figure out where the "dump" region is placed and how large it is and then write it out in format compatible to the one that the compliance test reference dumps are. Or we write out in different format and use some post processing scripts to convert the dumps to correct formatting.

We could also use any other RTL simulator as well, in this case we sure would need some post processing scripts. This would also be a valid option as it is not required by the rules that compliance test must be done using verilator, it just has to be done by some means.

We could also validate on real FPGA in that case we would need to dump the RAM after test to console and log it for later comparison, this would also be valid compliance test.

What path we take depends on our experience, skills and mood I guess. Using the real FPGA would be most time consuming as it would require the FPGA board to loaded with all the test images and then all data logged over serial port, not really funny. So Verilator or RTL simulator based approaches are faster and require less manual work as the tests would run in one batch. And I would not envy the judges if they have to use FPGA JTAG programming and UART console logging to verify all the contest entries.

Could we just implement the ASSERT IO Macros and forget the signature dumps? Unfortunately no, the compliance documentation does not allow this method, at least not yet. So if we do not do the signature dumps the contest judges may disqualify our entry as non compliant.

(Must) be possible to be simulated using Verilator

This is hard requirement by the rules. The rules however do not say that we have to simulate the CPU with Verilator or provide any scripts or testbenches for Verilator. As long as we use plain verilog files we should be fine right? But how would the judges verify our claim that Verilator simulation is possible? Would the judges create the requirement setup for Verilator and verify our SoftCPU in the time they have for it? Deadline is 23:59 on 26th November and winners are announced on 3rd of December!

No. No way the judges would have time to do that? Would they? Actually they have to verify the claims for the competition to be fair. It could be that the winning entry does not simulate with Verilator, if the judges did not test for it and there was no documentation and proof in the contest entry either? The wrong entry would then win, it could be you who loose.

I would say to be safe you should provide some Verilator testbench/script and documentation about it as part of your entry.

But "test coverage" is another thing, this is not mentioned in the rules at all. So it would be perfectly legal to wire the instruction memory read bus to 0x63000000 and start Verilator with the CPU Core. It would be simulation, fairly minimal but still a valid simulation.

There is however another problem - to win the contest we pretty much are forced to use FPGA architecture hard IP primitives directly using vendor libraries or vendor IP Core generator. For those hard IP blocks we do not have verilog simulation code. So in order to make the SoftCPU Verilator friendly we need to provide pure verilog simulation code to be used in place of those hard IP blocks.

What brings us to next problem - if we validate in simulation with Verilator or RTL simulator we are forced to use verilog only version of our SoftCPU (that replaces hard macro IP blocks with verilog) - if that simulation only code works different than the real hard IP blocks then our validation is invalid - the CPU would pass compliance in simulation but not in the FPGA.

So to be really really safe we should provide compliance test on real FPGA because Verilator would not use the same code base as FPGA tech optimized code for the SoC implementation.

You say we SHOULD write in FPGA vendor neutral verilog? This just is not possible, one example would be Microsemi targets - there we sure would need to use eNVM and/or eSRAM for ROM/RAM storage, but those resources are only accessible via Libero SmartDesign and exposed as block box with AHBlite interface with no BSD licensed simulation verilog available.

Sure if we provide AHBlite RAM mode for Verilator and use SmartDesign based eSRAM hard IP Block in FPGA design judges would not disqualify us if we only provide compliance testing in simulation.

But if we use say Math/DSP blocks in "enhanced" way in Microsemi and/or Lattice iCE+ Designs the issue is way more complex. If the "verilog" model we provide to "mimic" FPGA vendors hard IP block is not correct?

Conclusion: to be safe we should run all compliance tests in simulation (Verilator preferred) and also in the real hardware (at least if we use vendor IP blocks directly).

Dhrystone

From the rules: performance will be measured with the Dhrystone benchmark (from riscv github!) compiled with -O3 -fno_inline option. We should assume that we must run those source files from the referenced github location without modifications, right?

The main C file dhrystone_main.c prints out following as result:

printf("Microseconds for one run through Dhrystone: %ld\n", Microseconds); printf("Dhrystones per Second: %ld\n", Dhrystones_Per_Second);

So we do get single metrics - Dhrystones per Second as result, we must assume that this is the only result that is used in the performance scoring. It is not Dhrystones/MHz - no it is Dhrystones per Second - this means that the contest is not for SoftCPU but for FPGA SoC implementation as the Dhrystone result is highly dependable on the Memory subsystem performance in the FPGA and maximum reachable CPU clock frequency.

So what we need to optimize for speed are:
  1. SoftCPU performance tuned for Dhrystone benchmark only
  2. Memory subsystem performance
  3. Bus structure performance
  4. FPGA timings tuning to reach higher clock for our SoC design
What about overclocking? How much overclocking is allowed? We could even say that we need to increase IGLOO2 core voltage to 1.25V and cool it to -40C this would improve FPGA timings a lot. Ok lets forget overclocking (but it would not violate the rules actually).

A good reading on Dhrystone is the EEMBC Whitepaper about it. Well most the ways to fake the results - we can not use in this contest. Wait, Zephyr GCC is required to be used for Zephyr RTOS, but the rules do not actually say what compiler should be used for the Dhrystone test? So if we go very technical we could use the "optimized" compiler to optimize our Dhrystone result. Well I guess we would get disqualified but by the rules it would be valid.

What about the -DREG option? The rules say nothing about it. There are actually many more parts of the benchmark build that are not clear. From the riscv official github the commandline for the performance tests is
-DPREALLOCATE=1 -mcmodel=medany -static -std=gnu99 -O2 -ffast-math -fno-common -fno-builtin-printf
The contest requires -O3, so we must assume the Makefile from official riscv benchmark repository should not be used. What about other files from the official repository? The dhrystone C file that we assume we MUST use does refer "util.h" include file. This file is located 
riscv-tests/benchmark/common 
are we required to use this util.h file? Or can we modify the benchmark source C code? This include file also pulls in encode.h an defines:

extern void setStats(int enable);

This function is defined in /benchmarks/common/syscalls.c - if we look at that file it is clearly made to be used only in instruction set simulators not in RTL simulation or in FPGA benchmarking.

So what can we do? Should we modify the benchmark source code? Better not, but then we would need to provide some other "util.h" replacing the one from riscv github with our own.

Time - we need real time timer as well, or the benchmarking would not makes sense, but hey we could accidentally have the timer to run at say 5% wrong clock? It could improve our score by 5%?

What about the "Smallest" category, do we have to provide Dhrystone capability or not? Dhrystone requires some sort of real time clock, for the smallest category we could omit that as Dhrystone would not be used to score it. But maybe it is still required to have possibility to run Dhrystone tests on the entry that only targets the "smallest" category? Not clear. To be safe we should make sure we can run Dhrystone test even if we clearly target only the "smallest" category.

GETTING MAD, 32 bits at a time..

To understand the story with the Dhrystone for riscv in the context of the RISCV SoftCPU contest I did try it out, here it goes,  toolchain?  Lets take the official one, and pre-compiled one to be sure that it is correctly configured and compiled, so from here:

https://gnu-mcu-eclipse.github.io/toolchain/riscv/

Here it clearly says that this page provides the correct multi-lib toolchain for embedded (non linux) targets. Absolute everything says this must be correct toolchain to be used.

The contest targets RV32I so I setup build script for the Dhrystone using -march=rv32i/-mabi=ilp32 I am using unmodified files from riscv github, this is how far I get:

undefined reference to __umoddi3
undefined reference to __mulsi3
undefined reference to __divsi3

Errors come from syscalls.c from dhrystone.c and from dhrystone_main.c from all included C files!

Quick google search says that this errors happen when targetting 32 bit RISCV with toolchaint that is incorrectly configured - multilib option not enabled. What? The very web page where I got the toolchain says it is "multilib" toolchain? Does it mean multilib in some other context?

What about -march=rv32im ?

Wah - the errors from Dhrystone and Dhrystone_main disappeared only a few errors from syscalls.c remained!

Now this is important - this simple test clearly shows that if we are competing for the highest performance category we must implement RV32IM, this is not option, this is pretty much a requirement (well on the Microsemi platform at least).

Dhrystone uses strcmp function once, this is implemented in syscalls - if we manage to optimize it even a little we have gained some benefit, or we could just return correct result without performing the function - this would be faking of course. But if we do not optimize the strcmp maybe our competitor does and uses that performance boost to win? It is not really clear the status of the syscalls file, I would assume that it is OK to modify it, say those:

extern volatile uint64_t tohost;
extern volatile uint64_t fromhost;

Are in syscalls to TALK to the instruction set simulator, we are not however not doing tests in instruction set simulator, so we pretty much should modify syscall to match our embedded FPGA SoC ? I assume we can do it without violating the contest rules. OTOH it may also be possible to modify the SoftCPU FPGA SoC in such way that the syscalls from riscv github could be used without modifications?

Are you confused? I truly am.

to be continued...

No comments:

Post a Comment