Wednesday, October 10, 2018

RISC-V SoftCPU Contest

This contest was initially launched at ORCONF 2018 in Gdansk and is officially now hosted at riscv.org - RISC-V SoftCPU Contest.

Big question, what is evaluated in the contest: a SoftCPU core or SoC based on the SoftCPU core?
From the rules:

The entries will be RV32I-compliant soft CPU's.

But it is also clear that the SoftCPU core itself can not pass the minimum requirements as it has to be a complete FPGA implementation and it must run Philosophers and Synchronization examples of Zephyr RTOS. So to pass the requirements we must have a minimal SoC that uses our SoftCPU while for the resource utilization we should only count the resources used by the SoftCPU right? The SoC system bus and peripherals components are for sure not part of the CPU.

Requirements:

  1. RV32I-Compliant - exact documents not specified!
  2. For performance category Dhrystone is used
  3. Must run Zephyr 1.13 version keeping RTOS core (not specified) untouched
  4. Must pass (assumed ALL) RV32I compliance tests
  5. Must boot Philosophers and Synchronization (but they may fail?)
  6. Complete FPGA design for IGLOO2, SmartFusion2 or iCE40 UltraPlus
  7. Must use unmodified GCC toolchain provided by Zephyr
  8. Must use verilog and must be possible to simulate with verilator
  9. Must include binary version of the bitstream and instruction to build it
Let's look the requriements - from risc-V website we can get two ISM (Instruction Set Manual) one for user level ISA - version 2.2 and one for privileged ISA version 1.10 we must assume that compliance for RV32I can be figured out by study of those documents.

There is no way those documents describe the RV32I requirements cleanly, first there are user and privileged ISM documents. We absolutely have to follow the user ISM manual, this is absolutely sure. But what about the privileged ISM document? If we want (and we have too) pass the "compliance testing" in full we must implement part of the privileged stuff, there is no way around it. But it is not defined what parts - so what can we do?

We look the "compliance test" suite and list all the features that are exercised there and implement them using both ISM documents.

Some examples: privileged ISM describes WFI instruction not present in user ISM as this instruction is not tested by the compliance suite we do not need to implement it. But MRET we need to implement from privileged ISM as it used in compliance suite.

It is equally bad story with the CSR's - user ISM describes three 64 bit CSR as part of RV32I - but the compliance suite uses many more, and there are much more described in the privileged ISM. Which ones do we need to implement to be RV32I compliant?

There is no definitive answer to those issues.

Lets take mtval CSR it is described in privileged ISM it is also partially tested in the compliance suite but referenced by the old name mbadaddr from outdated ISM document. So mtval should contain the instruction CODE on illegal instruction trap, but as this is not checked we do not need to implement this right? We need to implement mtval only to the extent the compliance suite is testing it.

Another issue - what about register widths? In some cases some registers would always contain zeros in leading bits on given system can we hard wire them to 0? As example we can limit external memory region to be only 1 Mbyte - on small SoC in small FPGA all the available memory and peripherals would fit into that memory space. In this case program counter would not need to have full 32 bits? Same for some other registers. As those bits would always be 0 they could be implemented as constant 0? Or would that be considered as violation of the ISM document?

Very very complicated - the requirements "RV32I" are not clear at all.

User ISM document says that three 64 bit CSRs are mandatory (counters), however the ISM allows for low end implementations the upper 32 bits to be implemented in software! But no matter 32 or 64 bits we need 3 counters that are read only. Counters can be implemented using DSP blocks, this would save plenty of FPGA logic elements. So if we want to minimize logic use, we should use the FPGA architecture hard DSP IP Blocks for the those counters. Because if we do not this, our competitor would. It also seems that we can combine the RDCYCLE and RDTIME counters in one counter, there is nothing in any documents that would say it is not valid to do so.

Ok lets try to figure out the minimum CSR needed to satisfy the contest rules.

First - all unimplemented must be readable and return 0 (like misa and many more).
mtvec can be hard wired and read only, for optimization a vector with least 1's would be best
mscratch - needed used in compliance test
mtval - needed used in compliance test (but only partial functions are tested..)
mepc - needed (lower 2 bits should be 0)
mcause - needed but the number of exception codes supported is not clear
mip/mie - not clear if needed as the interrupt handling is not mandatory
mstatus - not clear if needed, zephyr core does save/restore it, but does not use directly

For minimal system supporting simplest RTOS we need at least a "tick" interrupt, this however could be implemented as external "tick" triggering NMI with special logic to be enabled (and disabled on power on reset) - this would allow us to exclude all interrupt related logic from the RV32I core.

OK, this is now really complicated, from the rules: zephyr "core" should not be modified, it is however not specified what is considered as part of zephyr core?

Maybe it refers to the riscv32 core arch?
https://github.com/zephyrproject-rtos/zephyr/tree/master/arch/riscv32/core
Or does it refer to some undefined set of files from undefined selected directories?

In any case it is not clearly visible what is the minimal required support for the systick/timer for the zephyr (in the way that "core" is not modified).

In short: the requirements for the RV32I are not clear.

Now lets look the grading for the smallest implementation, FPGA resource utilization, how is this calculated?

There are no weight on DSP and RAM vs LE (logic elements). Smallest number of total resources !? This can only be read as: every resource counts as 1 for the total count. So each DSP and RAM instance has same weight as Logic Elements. Now logic elements include LUT and Flip-Flops -sometimes only logic is used, sometimes only a flip flop, and sometimes both. Scoring does not count that, so if we have more Logic Elements that include both LUT and Flip Flop we win..

Based on the above to win in the smallest category, we should:
  1. Implement all adders and counters using math blocks.
  2. Implement register file using minimal number of RAM blocks
  3. Push as much as possible from the SoftCPU to the SoC subsystem
And now lets make it more complicated, the rules say that on the SmartFusion2 we should not use hard CPU subsystems as part of the SoftCPU, but hey there is 256K eNVM in IGLOO2, this resource is not part of hard CPU and it is not logic element or ram or math block. So if we use that resource as ROM lookup or microcode storage it would not count towards the resource utilization. Interesting uh? Interesting is also a note that "interesting ways" to enhance the design using the hard CPU subsystem can be implemented - do we get special points for this also? No idea, possible not.

Hmm.. "Hard CPU subsystem should not be used such" - but what about eSRAM? This gets things really complicated, lets say we implement the SoftCPU on IGLOO2 and use the eSRAM as register file, this is well kinda stupid, but it would not violate the rules! On IGLOO2 eSRAM is counted as part of the FPGA and if used would increase the utilization of the SoftCPU. But lets move the same design to SmartFusion2, now the same resource eSRAM is part of MSS and not counted as FPGA RAM Resource - so what now, is our design now against the rules, or should we "not count" eSRAM as it is not part of the FPGA Fabric? Complicated.

Again, short: the way of calculating the "resources" is not clear at all.

TIP 1: expose as low level and simple interface to external world as possible, push all the "bus adapter" and bus multiplexer code out from the SoftCPU core and this external bus should be as narrow as possible, say only 1 Mbyte total, this saves some logic too.

Target Devices with resources:

IGLOO M2GL025
LE (4LUT+FF): 27696
DSP 18x18: 34
eNVM: 256KB
Total RAM: 1104 Kbits includes eSRAM

SmartFusion M2S025
LE (4LUT+FF): 27696
DSP 18x18: 34
eNVM: 256KB - part of MSS
Total RAM (Fabric): 592 Kbits excludes eSRAM ?

ICE40 UP5K
LE (LUT4): 5280
Total RAM: 1024Kbit (no init!) + 120 Kbits
DSP 16x16: 8

Lets look at the links provided for evaluation boards:

iCE40 UltraPlus Breakout Board - this board has FT2232 for programming but not for UART (channel B has no connections) so we need to have extra TTL USB UART adapter or then emulate the Philosopher messages with morse code on LED's. Not cool. The board has SPI flash that we can use to store the software image to be loaded to the SPRAM that can not be initialized from the boot image.

iCE40 UltraPlus MDP provides both Programming and UART connection and SPI Flash.

Upduino V2 includes USB Programming and SPI Flash

iCEvision includes USB bootloader? I SELL ON tindie ? Number of orders since April is 3!? No documentation? No thanks :) not for me.

Well no matter what iCE board we would use, we would need to implement the SPI bootloader (or UART bootloader) to boot the zephyr code.

On the Microsemi boards we can use the eNVM for the zephyr boot code.

How to win the smallest implementation contest

If the grading is really done by the total resources per SoftCPU and performance is not at all evaluated then we could implement a single bit serialized core. I have done a partial bit-serial implementation of ARM Cortex M0 - it for sure reduces resources, RISC-V could even be simpler in bit serial mode than ARM Cortex. But when we look a the CSRs that also need to be implemented than the resulting bit-serial risc-V may not be the smallest implementation. So what options we have left? Sure - microcode implementation!

Here we have two options, we can use some existing soft Core (maybe tweaked and tuned) or we may create a new FPGA architecture optimized softCore designed to execute some microcode that emulates RV32I.

How small can it be? Short answer, it could be damn small. It would be stack based, use only one block RAM resource one accumulator (top of stack register).

There is one project that I made long time ago, that has an softCore that could for sure emulate RV32I to satisfy the rules of this contest, it was implemented in Microsemi (then Actel) ProAsic3 device A3P060 with the following resources:

Equivalent LE: 700
DSP: none
RAM Bits: 4608 bit (4 block x 512 Byte)
FlashROM bits: 1024

With A3P060 I implemented a "specialized ASSP" with following features:
* AVR like softCPU optimized ISA
* FlashROM was used to bootstrap initial code from SPI flash
* Code space was "banked" and I used loadable overlays from from SPI flash
* There was some logic for streaming reads from SD card in 4 bit mode
* And there was some other logic (some folks now what...)
* It was programmed using AVR Basic Compiler (product of Silicon Studio aka Antti Lukats)

This SoC system did fit into A3P060 !

I am be 100% sure that I could implement RV32I on that SoC (with very minor modifications) even if I need to load 512 word AVR code overlay for each RV32I instruction, see performance does not matter at all. So it would be still valid RV32I SoftCPU as of the rules.

But even that optimized AVR is too large, the microcode engine that would eventually win the smallest implementation could be much smaller. But then you would need to write an compiler for stack based CPU, something I have not yet managed.

So let me make forecast - smallest implementation for RV32I SoftCPU that satisfies the rules could be on both architectures as small as:

MicroCode engine for RV32I:
LE: 200 ? Maybe less, depends the time you spend to optimize the architecture
DSP: 2 ? One for PC increment and one for ADD function
Block RAM: 1 used for Stack and everything else

both the RISCV code as emulation engine microcode would come from external SPI flash so the microcode storage would not count towards the resources. Would be damn slow but that is OK for the smallest implementation. It would be smaller than bit-serial implementation without microcode.

So why am I am not doing it with 200 LE if I claim it could done? Well that is not a challenge for me, I know I could do it, so the only thing for motivation would be the prize money. And that is just too low. And well if I would submit 200 LE version, you could still beat me with 198 LE implementation, or maybe you do it with 156 or less? Nonsense. It makes no sense to optimize that low. The rules just do not make sense.

Sorry.

There is another thing - a 200 LE version (or maximum resource optimized with the goal to win) of RV32I that satisfies all the rules set for this contest is not meaningful to be used in any real life design. It would be an effort to get the prize money, not to create something that is useful and re-usable.

So let me try to setup some rules that would make sense for the smallest implementation.

NO-NONSENSE RISCV Contest Rules version 1:
  1. RV32I or RV32E
  2. Must run hello world compiled with "official" GNU RISCV GCC
  3. Minimal SoC should fit Lattice XO2-1200 executing code from SPI flash (XiP)
A SoftCPU that satisfies the above rules would be usable in real life projects. 
Hint: AVR SoC inside XO2-1200 leaves plenty of resources free for application specific peripherals. 

Bit-serial RV32E would be cool. In less than 1000 LUT ? Really Cool.

Now I should shake some prize money ? ;)

Well, bitserial RV32E would really make sense no matter if it fits 1000 LUT or runs on XO2-1200.

No comments:

Post a Comment