Monday, October 15, 2018

RISCV SoftCPU Contest Part III

RISCV - Getting started with Zephyr

To my big surprise Zephyr provides build instruction for Windows users, this is really unexpected.

After lots of tweaking and a long evening on the next morning I am getting closer to compile something with zephyr targeting riscv architecture.

Blinky Example (blink a LED with GPIO), error for different target boards selected:

BOARD=m2gl025_miv - error gpio drivers are missing
BOARD= zedboard_pulpino - /core/isr.S:447 Error: unrecognized opcode 'eret'
BOARD=hifive1 - /core/fatal.c:198 undefined reference to 'cause_str'
BOARD=qemu_riscv32 - error gpio drivers are missing

So it seems that LED Blinky example is not running out of the box.

Hello World Example

BOARD=m2gl025_miv - OK, ROM: 10 Kbyte, RAM 4 Kbyte
BOARD= zedboard_pulpino - /core/isr.S:447 Error: unrecognized opcode 'eret'
BOARD=hifive1 - OK, ROM: 14 Kbyte, RAM 4 Kbyte
BOARD=qemu_riscv32 - OK, ROM: 1 Kbyte, RAM 13 Kbyte

Philosophers

BOARD=m2gl025_miv - OK, ROM: 18 Kbyte, RAM 9 Kbyte
BOARD=qemu_riscv32 - OK, ROM: 1 Kbyte, RAM 27 Kbyte

Synchronization 

BOARD=m2gl025_miv - OK, ROM: 11 Kbyte, RAM 6 Kbyte
BOARD=qemu_riscv32 - OK, ROM: 1 Kbyte, RAM 16 Kbyte

What is interesting is that ROM code for QEMU is always the same size 1052 Bytes long while for the real targets the ROM size is different. The reason for this are the linker files, the instructions are not always in ROM sections - so we can not use the default memory map statistics to see how large the instruction memory is.

What we do see is that 32K shared RAM space for instructions and data should be sufficient for all the RTOS examples that are needed to pass the requirements set by the contest rules.

Setting up compliance testing

Currently the mains stream RISCV Compliance suite is only providing "environment" for instruction set simulators. Forked repo https://github.com/micro-FPGA/riscv-compliance
provides some more environments, most notably "absmin" enviroment.
 
What is the best way to test the compliance of the SoftCPU ? Well the best way is that we design a new Instruction Set Simulaor that simulates our SoftCPU and that we validate our simulator first.
And here screenshot from this simulator passing compliance test suite for LW instruction (compiled with absmin environment).

Friday, October 12, 2018

RISC-V SoftCPU Contest Part II

Let's look at the contest requirements in closer detail.

The CPU must pass RV32I compliance tests

As reference following github URL is given

https://github.com/riscv/riscv-compliance/tree/master/riscv-test-suite/rv32i

There is a Makefile and two subdirectories: src and references. The src directory includes the source code for tests and the reference directory reference dumps for the tests. We have no other choice as to assume that we need to pass all the tests, using all the test cases from the src directory. If we now read the the riscv compliance suite documentation a bit higher in the same github repository we see that it is required to use the reference signatures:

The only requirement needed in this case is that there must be an option to dump the results from the target in the test environment so as the comparison to test reference signature is possible.

So no matter how we validate our SoftCPU we must provide a way to dump the signature data from the memory after the test runs. We should use ALL the src from the compliance directory and compare with the signatures and we must pass match on all cases. This is the only way to pass the compliance tests.

Good thing is that there is a documentation how to setup the "target" environment for the RISCV compliance tests: https://github.com/riscv/riscv-compliance/tree/master/doc

Documentation is always good to have, right? So what we have in this document? Lets search for the word "verilog" - found, there is a topic about the use of Verilator. Very good, so we follow the given steps to setup verilator as "target" and we are half way done setting up the compliance suite environment, right?

No. The section for the Verilator has one word as content: "tbd" nothing more! Similarly there is section for one existing hardware target with the same word "tbd" as full content.

So, there is no documentation how to setup the RISCV compliance test environment for any RTL simulator or any real hardware target.

Cool, eh? So as part of the contest entry we must implement this without guiding documentation and with no example references. The only "targets" included in official environment are pure instruction set simulators.

What options we have?

First we could use verilator, in that case we should add support to dump out the signature from the memory, this should be doable with some custom c coding, the code would need to figure out where the "dump" region is placed and how large it is and then write it out in format compatible to the one that the compliance test reference dumps are. Or we write out in different format and use some post processing scripts to convert the dumps to correct formatting.

We could also use any other RTL simulator as well, in this case we sure would need some post processing scripts. This would also be a valid option as it is not required by the rules that compliance test must be done using verilator, it just has to be done by some means.

We could also validate on real FPGA in that case we would need to dump the RAM after test to console and log it for later comparison, this would also be valid compliance test.

What path we take depends on our experience, skills and mood I guess. Using the real FPGA would be most time consuming as it would require the FPGA board to loaded with all the test images and then all data logged over serial port, not really funny. So Verilator or RTL simulator based approaches are faster and require less manual work as the tests would run in one batch. And I would not envy the judges if they have to use FPGA JTAG programming and UART console logging to verify all the contest entries.

Could we just implement the ASSERT IO Macros and forget the signature dumps? Unfortunately no, the compliance documentation does not allow this method, at least not yet. So if we do not do the signature dumps the contest judges may disqualify our entry as non compliant.

(Must) be possible to be simulated using Verilator

This is hard requirement by the rules. The rules however do not say that we have to simulate the CPU with Verilator or provide any scripts or testbenches for Verilator. As long as we use plain verilog files we should be fine right? But how would the judges verify our claim that Verilator simulation is possible? Would the judges create the requirement setup for Verilator and verify our SoftCPU in the time they have for it? Deadline is 23:59 on 26th November and winners are announced on 3rd of December!

No. No way the judges would have time to do that? Would they? Actually they have to verify the claims for the competition to be fair. It could be that the winning entry does not simulate with Verilator, if the judges did not test for it and there was no documentation and proof in the contest entry either? The wrong entry would then win, it could be you who loose.

I would say to be safe you should provide some Verilator testbench/script and documentation about it as part of your entry.

But "test coverage" is another thing, this is not mentioned in the rules at all. So it would be perfectly legal to wire the instruction memory read bus to 0x63000000 and start Verilator with the CPU Core. It would be simulation, fairly minimal but still a valid simulation.

There is however another problem - to win the contest we pretty much are forced to use FPGA architecture hard IP primitives directly using vendor libraries or vendor IP Core generator. For those hard IP blocks we do not have verilog simulation code. So in order to make the SoftCPU Verilator friendly we need to provide pure verilog simulation code to be used in place of those hard IP blocks.

What brings us to next problem - if we validate in simulation with Verilator or RTL simulator we are forced to use verilog only version of our SoftCPU (that replaces hard macro IP blocks with verilog) - if that simulation only code works different than the real hard IP blocks then our validation is invalid - the CPU would pass compliance in simulation but not in the FPGA.

So to be really really safe we should provide compliance test on real FPGA because Verilator would not use the same code base as FPGA tech optimized code for the SoC implementation.

You say we SHOULD write in FPGA vendor neutral verilog? This just is not possible, one example would be Microsemi targets - there we sure would need to use eNVM and/or eSRAM for ROM/RAM storage, but those resources are only accessible via Libero SmartDesign and exposed as block box with AHBlite interface with no BSD licensed simulation verilog available.

Sure if we provide AHBlite RAM mode for Verilator and use SmartDesign based eSRAM hard IP Block in FPGA design judges would not disqualify us if we only provide compliance testing in simulation.

But if we use say Math/DSP blocks in "enhanced" way in Microsemi and/or Lattice iCE+ Designs the issue is way more complex. If the "verilog" model we provide to "mimic" FPGA vendors hard IP block is not correct?

Conclusion: to be safe we should run all compliance tests in simulation (Verilator preferred) and also in the real hardware (at least if we use vendor IP blocks directly).

Dhrystone

From the rules: performance will be measured with the Dhrystone benchmark (from riscv github!) compiled with -O3 -fno_inline option. We should assume that we must run those source files from the referenced github location without modifications, right?

The main C file dhrystone_main.c prints out following as result:

printf("Microseconds for one run through Dhrystone: %ld\n", Microseconds); printf("Dhrystones per Second: %ld\n", Dhrystones_Per_Second);

So we do get single metrics - Dhrystones per Second as result, we must assume that this is the only result that is used in the performance scoring. It is not Dhrystones/MHz - no it is Dhrystones per Second - this means that the contest is not for SoftCPU but for FPGA SoC implementation as the Dhrystone result is highly dependable on the Memory subsystem performance in the FPGA and maximum reachable CPU clock frequency.

So what we need to optimize for speed are:
  1. SoftCPU performance tuned for Dhrystone benchmark only
  2. Memory subsystem performance
  3. Bus structure performance
  4. FPGA timings tuning to reach higher clock for our SoC design
What about overclocking? How much overclocking is allowed? We could even say that we need to increase IGLOO2 core voltage to 1.25V and cool it to -40C this would improve FPGA timings a lot. Ok lets forget overclocking (but it would not violate the rules actually).

A good reading on Dhrystone is the EEMBC Whitepaper about it. Well most the ways to fake the results - we can not use in this contest. Wait, Zephyr GCC is required to be used for Zephyr RTOS, but the rules do not actually say what compiler should be used for the Dhrystone test? So if we go very technical we could use the "optimized" compiler to optimize our Dhrystone result. Well I guess we would get disqualified but by the rules it would be valid.

What about the -DREG option? The rules say nothing about it. There are actually many more parts of the benchmark build that are not clear. From the riscv official github the commandline for the performance tests is
-DPREALLOCATE=1 -mcmodel=medany -static -std=gnu99 -O2 -ffast-math -fno-common -fno-builtin-printf
The contest requires -O3, so we must assume the Makefile from official riscv benchmark repository should not be used. What about other files from the official repository? The dhrystone C file that we assume we MUST use does refer "util.h" include file. This file is located 
riscv-tests/benchmark/common 
are we required to use this util.h file? Or can we modify the benchmark source C code? This include file also pulls in encode.h an defines:

extern void setStats(int enable);

This function is defined in /benchmarks/common/syscalls.c - if we look at that file it is clearly made to be used only in instruction set simulators not in RTL simulation or in FPGA benchmarking.

So what can we do? Should we modify the benchmark source code? Better not, but then we would need to provide some other "util.h" replacing the one from riscv github with our own.

Time - we need real time timer as well, or the benchmarking would not makes sense, but hey we could accidentally have the timer to run at say 5% wrong clock? It could improve our score by 5%?

What about the "Smallest" category, do we have to provide Dhrystone capability or not? Dhrystone requires some sort of real time clock, for the smallest category we could omit that as Dhrystone would not be used to score it. But maybe it is still required to have possibility to run Dhrystone tests on the entry that only targets the "smallest" category? Not clear. To be safe we should make sure we can run Dhrystone test even if we clearly target only the "smallest" category.

GETTING MAD, 32 bits at a time..

To understand the story with the Dhrystone for riscv in the context of the RISCV SoftCPU contest I did try it out, here it goes,  toolchain?  Lets take the official one, and pre-compiled one to be sure that it is correctly configured and compiled, so from here:

https://gnu-mcu-eclipse.github.io/toolchain/riscv/

Here it clearly says that this page provides the correct multi-lib toolchain for embedded (non linux) targets. Absolute everything says this must be correct toolchain to be used.

The contest targets RV32I so I setup build script for the Dhrystone using -march=rv32i/-mabi=ilp32 I am using unmodified files from riscv github, this is how far I get:

undefined reference to __umoddi3
undefined reference to __mulsi3
undefined reference to __divsi3

Errors come from syscalls.c from dhrystone.c and from dhrystone_main.c from all included C files!

Quick google search says that this errors happen when targetting 32 bit RISCV with toolchaint that is incorrectly configured - multilib option not enabled. What? The very web page where I got the toolchain says it is "multilib" toolchain? Does it mean multilib in some other context?

What about -march=rv32im ?

Wah - the errors from Dhrystone and Dhrystone_main disappeared only a few errors from syscalls.c remained!

Now this is important - this simple test clearly shows that if we are competing for the highest performance category we must implement RV32IM, this is not option, this is pretty much a requirement (well on the Microsemi platform at least).

Dhrystone uses strcmp function once, this is implemented in syscalls - if we manage to optimize it even a little we have gained some benefit, or we could just return correct result without performing the function - this would be faking of course. But if we do not optimize the strcmp maybe our competitor does and uses that performance boost to win? It is not really clear the status of the syscalls file, I would assume that it is OK to modify it, say those:

extern volatile uint64_t tohost;
extern volatile uint64_t fromhost;

Are in syscalls to TALK to the instruction set simulator, we are not however not doing tests in instruction set simulator, so we pretty much should modify syscall to match our embedded FPGA SoC ? I assume we can do it without violating the contest rules. OTOH it may also be possible to modify the SoftCPU FPGA SoC in such way that the syscalls from riscv github could be used without modifications?

Are you confused? I truly am.

to be continued...

Wednesday, October 10, 2018

RISC-V SoftCPU Contest

This contest was initially launched at ORCONF 2018 in Gdansk and is officially now hosted at riscv.org - RISC-V SoftCPU Contest.

Big question, what is evaluated in the contest: a SoftCPU core or SoC based on the SoftCPU core?
From the rules:

The entries will be RV32I-compliant soft CPU's.

But it is also clear that the SoftCPU core itself can not pass the minimum requirements as it has to be a complete FPGA implementation and it must run Philosophers and Synchronization examples of Zephyr RTOS. So to pass the requirements we must have a minimal SoC that uses our SoftCPU while for the resource utilization we should only count the resources used by the SoftCPU right? The SoC system bus and peripherals components are for sure not part of the CPU.

Requirements:

  1. RV32I-Compliant - exact documents not specified!
  2. For performance category Dhrystone is used
  3. Must run Zephyr 1.13 version keeping RTOS core (not specified) untouched
  4. Must pass (assumed ALL) RV32I compliance tests
  5. Must boot Philosophers and Synchronization (but they may fail?)
  6. Complete FPGA design for IGLOO2, SmartFusion2 or iCE40 UltraPlus
  7. Must use unmodified GCC toolchain provided by Zephyr
  8. Must use verilog and must be possible to simulate with verilator
  9. Must include binary version of the bitstream and instruction to build it
Let's look the requriements - from risc-V website we can get two ISM (Instruction Set Manual) one for user level ISA - version 2.2 and one for privileged ISA version 1.10 we must assume that compliance for RV32I can be figured out by study of those documents.

There is no way those documents describe the RV32I requirements cleanly, first there are user and privileged ISM documents. We absolutely have to follow the user ISM manual, this is absolutely sure. But what about the privileged ISM document? If we want (and we have too) pass the "compliance testing" in full we must implement part of the privileged stuff, there is no way around it. But it is not defined what parts - so what can we do?

We look the "compliance test" suite and list all the features that are exercised there and implement them using both ISM documents.

Some examples: privileged ISM describes WFI instruction not present in user ISM as this instruction is not tested by the compliance suite we do not need to implement it. But MRET we need to implement from privileged ISM as it used in compliance suite.

It is equally bad story with the CSR's - user ISM describes three 64 bit CSR as part of RV32I - but the compliance suite uses many more, and there are much more described in the privileged ISM. Which ones do we need to implement to be RV32I compliant?

There is no definitive answer to those issues.

Lets take mtval CSR it is described in privileged ISM it is also partially tested in the compliance suite but referenced by the old name mbadaddr from outdated ISM document. So mtval should contain the instruction CODE on illegal instruction trap, but as this is not checked we do not need to implement this right? We need to implement mtval only to the extent the compliance suite is testing it.

Another issue - what about register widths? In some cases some registers would always contain zeros in leading bits on given system can we hard wire them to 0? As example we can limit external memory region to be only 1 Mbyte - on small SoC in small FPGA all the available memory and peripherals would fit into that memory space. In this case program counter would not need to have full 32 bits? Same for some other registers. As those bits would always be 0 they could be implemented as constant 0? Or would that be considered as violation of the ISM document?

Very very complicated - the requirements "RV32I" are not clear at all.

User ISM document says that three 64 bit CSRs are mandatory (counters), however the ISM allows for low end implementations the upper 32 bits to be implemented in software! But no matter 32 or 64 bits we need 3 counters that are read only. Counters can be implemented using DSP blocks, this would save plenty of FPGA logic elements. So if we want to minimize logic use, we should use the FPGA architecture hard DSP IP Blocks for the those counters. Because if we do not this, our competitor would. It also seems that we can combine the RDCYCLE and RDTIME counters in one counter, there is nothing in any documents that would say it is not valid to do so.

Ok lets try to figure out the minimum CSR needed to satisfy the contest rules.

First - all unimplemented must be readable and return 0 (like misa and many more).
mtvec can be hard wired and read only, for optimization a vector with least 1's would be best
mscratch - needed used in compliance test
mtval - needed used in compliance test (but only partial functions are tested..)
mepc - needed (lower 2 bits should be 0)
mcause - needed but the number of exception codes supported is not clear
mip/mie - not clear if needed as the interrupt handling is not mandatory
mstatus - not clear if needed, zephyr core does save/restore it, but does not use directly

For minimal system supporting simplest RTOS we need at least a "tick" interrupt, this however could be implemented as external "tick" triggering NMI with special logic to be enabled (and disabled on power on reset) - this would allow us to exclude all interrupt related logic from the RV32I core.

OK, this is now really complicated, from the rules: zephyr "core" should not be modified, it is however not specified what is considered as part of zephyr core?

Maybe it refers to the riscv32 core arch?
https://github.com/zephyrproject-rtos/zephyr/tree/master/arch/riscv32/core
Or does it refer to some undefined set of files from undefined selected directories?

In any case it is not clearly visible what is the minimal required support for the systick/timer for the zephyr (in the way that "core" is not modified).

In short: the requirements for the RV32I are not clear.

Now lets look the grading for the smallest implementation, FPGA resource utilization, how is this calculated?

There are no weight on DSP and RAM vs LE (logic elements). Smallest number of total resources !? This can only be read as: every resource counts as 1 for the total count. So each DSP and RAM instance has same weight as Logic Elements. Now logic elements include LUT and Flip-Flops -sometimes only logic is used, sometimes only a flip flop, and sometimes both. Scoring does not count that, so if we have more Logic Elements that include both LUT and Flip Flop we win..

Based on the above to win in the smallest category, we should:
  1. Implement all adders and counters using math blocks.
  2. Implement register file using minimal number of RAM blocks
  3. Push as much as possible from the SoftCPU to the SoC subsystem
And now lets make it more complicated, the rules say that on the SmartFusion2 we should not use hard CPU subsystems as part of the SoftCPU, but hey there is 256K eNVM in IGLOO2, this resource is not part of hard CPU and it is not logic element or ram or math block. So if we use that resource as ROM lookup or microcode storage it would not count towards the resource utilization. Interesting uh? Interesting is also a note that "interesting ways" to enhance the design using the hard CPU subsystem can be implemented - do we get special points for this also? No idea, possible not.

Hmm.. "Hard CPU subsystem should not be used such" - but what about eSRAM? This gets things really complicated, lets say we implement the SoftCPU on IGLOO2 and use the eSRAM as register file, this is well kinda stupid, but it would not violate the rules! On IGLOO2 eSRAM is counted as part of the FPGA and if used would increase the utilization of the SoftCPU. But lets move the same design to SmartFusion2, now the same resource eSRAM is part of MSS and not counted as FPGA RAM Resource - so what now, is our design now against the rules, or should we "not count" eSRAM as it is not part of the FPGA Fabric? Complicated.

Again, short: the way of calculating the "resources" is not clear at all.

TIP 1: expose as low level and simple interface to external world as possible, push all the "bus adapter" and bus multiplexer code out from the SoftCPU core and this external bus should be as narrow as possible, say only 1 Mbyte total, this saves some logic too.

Target Devices with resources:

IGLOO M2GL025
LE (4LUT+FF): 27696
DSP 18x18: 34
eNVM: 256KB
Total RAM: 1104 Kbits includes eSRAM

SmartFusion M2S025
LE (4LUT+FF): 27696
DSP 18x18: 34
eNVM: 256KB - part of MSS
Total RAM (Fabric): 592 Kbits excludes eSRAM ?

ICE40 UP5K
LE (LUT4): 5280
Total RAM: 1024Kbit (no init!) + 120 Kbits
DSP 16x16: 8

Lets look at the links provided for evaluation boards:

iCE40 UltraPlus Breakout Board - this board has FT2232 for programming but not for UART (channel B has no connections) so we need to have extra TTL USB UART adapter or then emulate the Philosopher messages with morse code on LED's. Not cool. The board has SPI flash that we can use to store the software image to be loaded to the SPRAM that can not be initialized from the boot image.

iCE40 UltraPlus MDP provides both Programming and UART connection and SPI Flash.

Upduino V2 includes USB Programming and SPI Flash

iCEvision includes USB bootloader? I SELL ON tindie ? Number of orders since April is 3!? No documentation? No thanks :) not for me.

Well no matter what iCE board we would use, we would need to implement the SPI bootloader (or UART bootloader) to boot the zephyr code.

On the Microsemi boards we can use the eNVM for the zephyr boot code.

How to win the smallest implementation contest

If the grading is really done by the total resources per SoftCPU and performance is not at all evaluated then we could implement a single bit serialized core. I have done a partial bit-serial implementation of ARM Cortex M0 - it for sure reduces resources, RISC-V could even be simpler in bit serial mode than ARM Cortex. But when we look a the CSRs that also need to be implemented than the resulting bit-serial risc-V may not be the smallest implementation. So what options we have left? Sure - microcode implementation!

Here we have two options, we can use some existing soft Core (maybe tweaked and tuned) or we may create a new FPGA architecture optimized softCore designed to execute some microcode that emulates RV32I.

How small can it be? Short answer, it could be damn small. It would be stack based, use only one block RAM resource one accumulator (top of stack register).

There is one project that I made long time ago, that has an softCore that could for sure emulate RV32I to satisfy the rules of this contest, it was implemented in Microsemi (then Actel) ProAsic3 device A3P060 with the following resources:

Equivalent LE: 700
DSP: none
RAM Bits: 4608 bit (4 block x 512 Byte)
FlashROM bits: 1024

With A3P060 I implemented a "specialized ASSP" with following features:
* AVR like softCPU optimized ISA
* FlashROM was used to bootstrap initial code from SPI flash
* Code space was "banked" and I used loadable overlays from from SPI flash
* There was some logic for streaming reads from SD card in 4 bit mode
* And there was some other logic (some folks now what...)
* It was programmed using AVR Basic Compiler (product of Silicon Studio aka Antti Lukats)

This SoC system did fit into A3P060 !

I am be 100% sure that I could implement RV32I on that SoC (with very minor modifications) even if I need to load 512 word AVR code overlay for each RV32I instruction, see performance does not matter at all. So it would be still valid RV32I SoftCPU as of the rules.

But even that optimized AVR is too large, the microcode engine that would eventually win the smallest implementation could be much smaller. But then you would need to write an compiler for stack based CPU, something I have not yet managed.

So let me make forecast - smallest implementation for RV32I SoftCPU that satisfies the rules could be on both architectures as small as:

MicroCode engine for RV32I:
LE: 200 ? Maybe less, depends the time you spend to optimize the architecture
DSP: 2 ? One for PC increment and one for ADD function
Block RAM: 1 used for Stack and everything else

both the RISCV code as emulation engine microcode would come from external SPI flash so the microcode storage would not count towards the resources. Would be damn slow but that is OK for the smallest implementation. It would be smaller than bit-serial implementation without microcode.

So why am I am not doing it with 200 LE if I claim it could done? Well that is not a challenge for me, I know I could do it, so the only thing for motivation would be the prize money. And that is just too low. And well if I would submit 200 LE version, you could still beat me with 198 LE implementation, or maybe you do it with 156 or less? Nonsense. It makes no sense to optimize that low. The rules just do not make sense.

Sorry.

There is another thing - a 200 LE version (or maximum resource optimized with the goal to win) of RV32I that satisfies all the rules set for this contest is not meaningful to be used in any real life design. It would be an effort to get the prize money, not to create something that is useful and re-usable.

So let me try to setup some rules that would make sense for the smallest implementation.

NO-NONSENSE RISCV Contest Rules version 1:
  1. RV32I or RV32E
  2. Must run hello world compiled with "official" GNU RISCV GCC
  3. Minimal SoC should fit Lattice XO2-1200 executing code from SPI flash (XiP)
A SoftCPU that satisfies the above rules would be usable in real life projects. 
Hint: AVR SoC inside XO2-1200 leaves plenty of resources free for application specific peripherals. 

Bit-serial RV32E would be cool. In less than 1000 LUT ? Really Cool.

Now I should shake some prize money ? ;)

Well, bitserial RV32E would really make sense no matter if it fits 1000 LUT or runs on XO2-1200.

Tuesday, October 2, 2018

RISC-V Take I

RISC-V explained in two words: "Frozen ISA" - indeed this is all there is: a RISC Instruction set that is fixed and guarded by an foundation. Products designed to the fixed RISC-V ISA should remain compatible and working as long as they adhere to the ISA, that is the all point there is, nothing more.

There is however no known good implementation of RISC-V, there are many implementations but none of them is the golden reference.

picorv32 is nice and simple implementation written in verilog, with simple adapter to provide AXI bus support. It did take maybe an hour to convert it to Vivado IP Catalog IP core, so it looks in IP Integrator:

Now it takes just a few mouse clicks to create some simplest RISC-V SoC system:


For initial testing I just added AXI BRAM and then used Vivado automation to connect the bus infra for me.

Next is testing right? Now we need some C compiler or assembler at least for RISC-V, this should be simple right? Well no - it is not that easy to find windows executables for RISC-V. Well well, there is F32C project that does include those binaries, well binaries are actually hosted by FPGArduino web.

I have implemented F32C in FPGA before, I have used the IDE and compiled C code for f32c using their compiler. So it must work. It must produce valid and working code? Right? The compiled programs do work, I know they do.

As next I create smallest test program (in assembler) to compile a few instructions for the picorv32 system.This is what f32c/fpgarduino riscv compiler emitted in listing file:

   9              loop1:
  10 0010 63080000 beq zero,zero,loop1
  11 0014 6F000001 j loop1
  12 0018 6F000001 j loop1
  13             
  14              loop2:
  15 001c 630E0000 beq zero,zero,loop2
  16 0020 6F00C001 j loop2
  17 0024 6F000001 j loop1

What!? This can not be! Can it be that risc-v has NO relative jumps or branches !?
Look at the code generated, both jump and branch instruction addressing is absolute.

This can not be! Looking at RISC-V specification. It clearly says that both branches and jumps use relative addressing. I am looking at the listing again.. absolute addresses? But the code generated by this compiler works in FPGA, I know it does. It really does. Well if executed by f32c implementation of RISC-V...

I just cant believe it, did f32c developers really changed the relative addressing to absolute addressing? And then patched the GCC and provided binaries for this patched riscv toolchain for everybody to download? And wonder why nothing works?

I just can describe how I felt when I realized that the f32c/fpgarduino developers indeed are using absolute addressing compiler and broken softcore. I was not happy at all.

Next try, more search.. and found, gnu-mcu-eclipse toolchain binaries for all host OS :) !
It takes some reading to find different commandline switches needed:

riscv-none-embed-gcc.exe -c -march=rv32i -mabi=ilp32 -Wa,-adhln -g start.S >start.lst

OK, opcode for relative branch to own location (forever loop) is 0x63000000 and NOP is 0x13000000 this is all I wanted to know for starters.

Double click on the BRAM and then in COE file editor init values can be entered:

NOP
NOP
L1: beq zero,zero, L1
NOP

Now it is exciting, does picorv32 execute the branch correctly?


And it is working - program counter goes from 0 to 4 then to 8 where the branch instruction is, next location is also fetched, then execution continues again from location 8 the forever loop is working.

It is clearly visible in the trace that there is pretty large latency fetching the instructions from BRAM over AXI bus. Sure this is far away from optimal, but at least it works, and it is now instantly possible to create Vivado design with picorv32 that can use any of the AXI IP cores available for Vivado.
For faster BRAM access it would possible make sense to split the picorv32 bus address space between AXI and LMB and put the BRAM block on LMB bus. This is also not complicated, but right now not of primary importance.