Monday, December 24, 2018

GRVI de-mystified - Part I


This is just an exploration of GRVI Phalanx Microarchitecture based on open and public presentation papers, some claims from Gray Research about GRVI Microarchitecture:

  1. Datapath: 250 LUT (LUT6 or total LUT?)
  2. PE complete: 320 LUT at 375 MHz
  3. PE + share of cluster: ~480 LUT
  4. 4000 LUT per cluster
  5. 3 pipeline stages
  6. 2 cycle load/3 cycle taken branches/jumps
  7. Two pairs of operand multiplexers
  8. ALU, PC unit and comparator use carry logic
  9. RPM optimized to almost max?
  10. Datapath diagram and floorplan:


GRVI Datapath RPM: 35 CLB (280 LUT6) we should assume that most (or all?) of the datapath must be constrained into this region. Red are 5 CLB for the register file, green is ALU - clearly visible are carry chain blocks. Visible utilization:
  • Carry chains: 32 bit (ALU, green), two times 24 bit ? what !?
  • LUTRAM: 40 LUT (5 CLB) dual port register file (red)
  • Total used LUT6 (visible): 229
  • Total used FF (visible) 145
Due to the picture resolution the count may not be fully accurate. Also only count for LUT6 is given, many LUT6 are used as dual LUT5, there are at least 140 LUT6 used as dual LUT5 so if counting LUT (per LUT output used, exluding LUTRAM) we would get 370 LUT for datapath and if we count LUTRAM as dual LUT, the count would be 410 LUT for the datapath alone, pretty much exceeding the advertised count of ~250 LUT. When we count only LUT6 then there about 20 LUT missing from the RPM plan view.



GRVI Datapath from official presentation slides and documents.


GRVI Datapath with added details. Two pairs (2 x 2 = 4?) of operand multiplexers can not be true as there is really no need to multiplex both inputs to the compare unit. It is much more likely that the multiplexers are arranged as visible in detailed datapath; those multiplexers and 4 times 32 bit registers still fit to 8 CLB.
RPM Block with comments about guessed function block locations - something must be wrong. If we assume that the datapath from the diagram is fitted into the RPM map, and that that visible resource utilization is correct - there are still some functions that are not located: selector for immediate operand needs at least 19 LUT6 (2.5 CLB), this is just not there. The result multiplexer can not be 5:1, there is a trick to avoid that but for that to work the ALU (or DIN multiplexer in shared area) must be able to return zero as result, that would allow the 5:1 multiplexer to be used only for lowest ALU bit.

List (not complete) of minimal resources needed for GRVI PE (assuming 4K IMEM):
  1. 40 LUT6: Register file (5 CLB, minimal 37 LUT6?)
  2. 32 LUT6 + Carry: ALU (4 CLB)
  3. 32 LUT6 + ??: Compare unit
  4. 19 LUT6: immediate operand generator (2.5 CLB)
  5. 5 LUT6+10 FF: next_pc + if_pc (1 CLB)
  6. 10 LUT6 + 10FF: pc_incr + dc_pc (1.5 CLB)
  7. 32 LUT6: result mux (4 CLB)
  8. 5 FF: register file Rd address latch
  9. 1 LUT6 + 1 FF: register file write decoder and latch
  10. ? LUT: operand mux decoder (DC stage)
  11. ? LUT + ? FF: execute stage decoder/latch
  12. ? LUT + 2 FF: pipeline stall logic
Commentary based on public info and images from Gray Research:
  1. Instruction latch for DC stage implemented as BRAM primitives register
  2. There is absolutely no reasonable explanation for two 24 bit long carry chains!
  3. If PC unit really uses carry chain then the visual RPM map is incorrect
  4. If compare unit uses carry chain then the visual RPM map is incorrect 
  5. Resources for immediate operand selector can not be located, outside visible window?
  6. Store operations may take more than 1 cycle if arbitration lost
  7. Load operations may take more than 2 cycle if arbitration lost
  8. ALU does not use DSP (there are some 3rd parties saying it does)
Open question: Why is there result multiplexer datapath back to ALU? The only use of this would be if shift operations are implemented as loops.






Thursday, December 13, 2018

Xilinx BRAM generator, some tricks

FPGA vendors should know how to generate RAM from the primitives effectively one may think. I did at least. Until I tried it out for the RISC-V small implementation.

There are two memory blocks made with Vivado IP Integrator. For some reason this small RISC-V SoC did show resource utilization over 200 LUT while I know it should be a smaller than that. After looking at detailed report there was 24 LUT and 3 Flip-Flops consumed in 32KByte 8 bit RAM. How can this be, 32Kx8 bit memory should use 8 BRAM primitives and 0 LUT. Checking out in RTL view after synthesis:
Ok this explains part of the problem, BRAM's are configured as 8 bit wide with 8 to 1 multiplexer at the output. This generates some LUT, but where did those 3 flip flops come? Looking again in post implementation RTL:


Right they are needed the address must be delayed for one clock for the output multiplexer to work properly, so those 3 flip flops are really needed.
So when the complete RISC-V soft CPU takes 59 Slices then the "extra added overhead" from Xilinx RAM generator takes 11 Slices! Checking out configuration options:
So where is 32kx1 ? This would be the one to choose when making 32K deep memory, this options is simply missing. Lets try what happens if we select 16kx1 - OK this is looking better, this time the RAM synthesizer is using 32kx1, well selected was 16kx1, so the generator must have guessed my mind and did what I wanted.
Both memory blocks are using now only BRAM and no logic resources.


Looks nice and works too!


Monday, October 15, 2018

RISCV SoftCPU Contest Part III

RISCV - Getting started with Zephyr

To my big surprise Zephyr provides build instruction for Windows users, this is really unexpected.

After lots of tweaking and a long evening on the next morning I am getting closer to compile something with zephyr targeting riscv architecture.

Blinky Example (blink a LED with GPIO), error for different target boards selected:

BOARD=m2gl025_miv - error gpio drivers are missing
BOARD= zedboard_pulpino - /core/isr.S:447 Error: unrecognized opcode 'eret'
BOARD=hifive1 - /core/fatal.c:198 undefined reference to 'cause_str'
BOARD=qemu_riscv32 - error gpio drivers are missing

So it seems that LED Blinky example is not running out of the box.

Hello World Example

BOARD=m2gl025_miv - OK, ROM: 10 Kbyte, RAM 4 Kbyte
BOARD= zedboard_pulpino - /core/isr.S:447 Error: unrecognized opcode 'eret'
BOARD=hifive1 - OK, ROM: 14 Kbyte, RAM 4 Kbyte
BOARD=qemu_riscv32 - OK, ROM: 1 Kbyte, RAM 13 Kbyte

Philosophers

BOARD=m2gl025_miv - OK, ROM: 18 Kbyte, RAM 9 Kbyte
BOARD=qemu_riscv32 - OK, ROM: 1 Kbyte, RAM 27 Kbyte

Synchronization 

BOARD=m2gl025_miv - OK, ROM: 11 Kbyte, RAM 6 Kbyte
BOARD=qemu_riscv32 - OK, ROM: 1 Kbyte, RAM 16 Kbyte

What is interesting is that ROM code for QEMU is always the same size 1052 Bytes long while for the real targets the ROM size is different. The reason for this are the linker files, the instructions are not always in ROM sections - so we can not use the default memory map statistics to see how large the instruction memory is.

What we do see is that 32K shared RAM space for instructions and data should be sufficient for all the RTOS examples that are needed to pass the requirements set by the contest rules.

Setting up compliance testing

Currently the mains stream RISCV Compliance suite is only providing "environment" for instruction set simulators. Forked repo https://github.com/micro-FPGA/riscv-compliance
provides some more environments, most notably "absmin" enviroment.
 
What is the best way to test the compliance of the SoftCPU ? Well the best way is that we design a new Instruction Set Simulaor that simulates our SoftCPU and that we validate our simulator first.
And here screenshot from this simulator passing compliance test suite for LW instruction (compiled with absmin environment).

Friday, October 12, 2018

RISC-V SoftCPU Contest Part II

Let's look at the contest requirements in closer detail.

The CPU must pass RV32I compliance tests

As reference following github URL is given

https://github.com/riscv/riscv-compliance/tree/master/riscv-test-suite/rv32i

There is a Makefile and two subdirectories: src and references. The src directory includes the source code for tests and the reference directory reference dumps for the tests. We have no other choice as to assume that we need to pass all the tests, using all the test cases from the src directory. If we now read the the riscv compliance suite documentation a bit higher in the same github repository we see that it is required to use the reference signatures:

The only requirement needed in this case is that there must be an option to dump the results from the target in the test environment so as the comparison to test reference signature is possible.

So no matter how we validate our SoftCPU we must provide a way to dump the signature data from the memory after the test runs. We should use ALL the src from the compliance directory and compare with the signatures and we must pass match on all cases. This is the only way to pass the compliance tests.

Good thing is that there is a documentation how to setup the "target" environment for the RISCV compliance tests: https://github.com/riscv/riscv-compliance/tree/master/doc

Documentation is always good to have, right? So what we have in this document? Lets search for the word "verilog" - found, there is a topic about the use of Verilator. Very good, so we follow the given steps to setup verilator as "target" and we are half way done setting up the compliance suite environment, right?

No. The section for the Verilator has one word as content: "tbd" nothing more! Similarly there is section for one existing hardware target with the same word "tbd" as full content.

So, there is no documentation how to setup the RISCV compliance test environment for any RTL simulator or any real hardware target.

Cool, eh? So as part of the contest entry we must implement this without guiding documentation and with no example references. The only "targets" included in official environment are pure instruction set simulators.

What options we have?

First we could use verilator, in that case we should add support to dump out the signature from the memory, this should be doable with some custom c coding, the code would need to figure out where the "dump" region is placed and how large it is and then write it out in format compatible to the one that the compliance test reference dumps are. Or we write out in different format and use some post processing scripts to convert the dumps to correct formatting.

We could also use any other RTL simulator as well, in this case we sure would need some post processing scripts. This would also be a valid option as it is not required by the rules that compliance test must be done using verilator, it just has to be done by some means.

We could also validate on real FPGA in that case we would need to dump the RAM after test to console and log it for later comparison, this would also be valid compliance test.

What path we take depends on our experience, skills and mood I guess. Using the real FPGA would be most time consuming as it would require the FPGA board to loaded with all the test images and then all data logged over serial port, not really funny. So Verilator or RTL simulator based approaches are faster and require less manual work as the tests would run in one batch. And I would not envy the judges if they have to use FPGA JTAG programming and UART console logging to verify all the contest entries.

Could we just implement the ASSERT IO Macros and forget the signature dumps? Unfortunately no, the compliance documentation does not allow this method, at least not yet. So if we do not do the signature dumps the contest judges may disqualify our entry as non compliant.

(Must) be possible to be simulated using Verilator

This is hard requirement by the rules. The rules however do not say that we have to simulate the CPU with Verilator or provide any scripts or testbenches for Verilator. As long as we use plain verilog files we should be fine right? But how would the judges verify our claim that Verilator simulation is possible? Would the judges create the requirement setup for Verilator and verify our SoftCPU in the time they have for it? Deadline is 23:59 on 26th November and winners are announced on 3rd of December!

No. No way the judges would have time to do that? Would they? Actually they have to verify the claims for the competition to be fair. It could be that the winning entry does not simulate with Verilator, if the judges did not test for it and there was no documentation and proof in the contest entry either? The wrong entry would then win, it could be you who loose.

I would say to be safe you should provide some Verilator testbench/script and documentation about it as part of your entry.

But "test coverage" is another thing, this is not mentioned in the rules at all. So it would be perfectly legal to wire the instruction memory read bus to 0x63000000 and start Verilator with the CPU Core. It would be simulation, fairly minimal but still a valid simulation.

There is however another problem - to win the contest we pretty much are forced to use FPGA architecture hard IP primitives directly using vendor libraries or vendor IP Core generator. For those hard IP blocks we do not have verilog simulation code. So in order to make the SoftCPU Verilator friendly we need to provide pure verilog simulation code to be used in place of those hard IP blocks.

What brings us to next problem - if we validate in simulation with Verilator or RTL simulator we are forced to use verilog only version of our SoftCPU (that replaces hard macro IP blocks with verilog) - if that simulation only code works different than the real hard IP blocks then our validation is invalid - the CPU would pass compliance in simulation but not in the FPGA.

So to be really really safe we should provide compliance test on real FPGA because Verilator would not use the same code base as FPGA tech optimized code for the SoC implementation.

You say we SHOULD write in FPGA vendor neutral verilog? This just is not possible, one example would be Microsemi targets - there we sure would need to use eNVM and/or eSRAM for ROM/RAM storage, but those resources are only accessible via Libero SmartDesign and exposed as block box with AHBlite interface with no BSD licensed simulation verilog available.

Sure if we provide AHBlite RAM mode for Verilator and use SmartDesign based eSRAM hard IP Block in FPGA design judges would not disqualify us if we only provide compliance testing in simulation.

But if we use say Math/DSP blocks in "enhanced" way in Microsemi and/or Lattice iCE+ Designs the issue is way more complex. If the "verilog" model we provide to "mimic" FPGA vendors hard IP block is not correct?

Conclusion: to be safe we should run all compliance tests in simulation (Verilator preferred) and also in the real hardware (at least if we use vendor IP blocks directly).

Dhrystone

From the rules: performance will be measured with the Dhrystone benchmark (from riscv github!) compiled with -O3 -fno_inline option. We should assume that we must run those source files from the referenced github location without modifications, right?

The main C file dhrystone_main.c prints out following as result:

printf("Microseconds for one run through Dhrystone: %ld\n", Microseconds); printf("Dhrystones per Second: %ld\n", Dhrystones_Per_Second);

So we do get single metrics - Dhrystones per Second as result, we must assume that this is the only result that is used in the performance scoring. It is not Dhrystones/MHz - no it is Dhrystones per Second - this means that the contest is not for SoftCPU but for FPGA SoC implementation as the Dhrystone result is highly dependable on the Memory subsystem performance in the FPGA and maximum reachable CPU clock frequency.

So what we need to optimize for speed are:
  1. SoftCPU performance tuned for Dhrystone benchmark only
  2. Memory subsystem performance
  3. Bus structure performance
  4. FPGA timings tuning to reach higher clock for our SoC design
What about overclocking? How much overclocking is allowed? We could even say that we need to increase IGLOO2 core voltage to 1.25V and cool it to -40C this would improve FPGA timings a lot. Ok lets forget overclocking (but it would not violate the rules actually).

A good reading on Dhrystone is the EEMBC Whitepaper about it. Well most the ways to fake the results - we can not use in this contest. Wait, Zephyr GCC is required to be used for Zephyr RTOS, but the rules do not actually say what compiler should be used for the Dhrystone test? So if we go very technical we could use the "optimized" compiler to optimize our Dhrystone result. Well I guess we would get disqualified but by the rules it would be valid.

What about the -DREG option? The rules say nothing about it. There are actually many more parts of the benchmark build that are not clear. From the riscv official github the commandline for the performance tests is
-DPREALLOCATE=1 -mcmodel=medany -static -std=gnu99 -O2 -ffast-math -fno-common -fno-builtin-printf
The contest requires -O3, so we must assume the Makefile from official riscv benchmark repository should not be used. What about other files from the official repository? The dhrystone C file that we assume we MUST use does refer "util.h" include file. This file is located 
riscv-tests/benchmark/common 
are we required to use this util.h file? Or can we modify the benchmark source C code? This include file also pulls in encode.h an defines:

extern void setStats(int enable);

This function is defined in /benchmarks/common/syscalls.c - if we look at that file it is clearly made to be used only in instruction set simulators not in RTL simulation or in FPGA benchmarking.

So what can we do? Should we modify the benchmark source code? Better not, but then we would need to provide some other "util.h" replacing the one from riscv github with our own.

Time - we need real time timer as well, or the benchmarking would not makes sense, but hey we could accidentally have the timer to run at say 5% wrong clock? It could improve our score by 5%?

What about the "Smallest" category, do we have to provide Dhrystone capability or not? Dhrystone requires some sort of real time clock, for the smallest category we could omit that as Dhrystone would not be used to score it. But maybe it is still required to have possibility to run Dhrystone tests on the entry that only targets the "smallest" category? Not clear. To be safe we should make sure we can run Dhrystone test even if we clearly target only the "smallest" category.

GETTING MAD, 32 bits at a time..

To understand the story with the Dhrystone for riscv in the context of the RISCV SoftCPU contest I did try it out, here it goes,  toolchain?  Lets take the official one, and pre-compiled one to be sure that it is correctly configured and compiled, so from here:

https://gnu-mcu-eclipse.github.io/toolchain/riscv/

Here it clearly says that this page provides the correct multi-lib toolchain for embedded (non linux) targets. Absolute everything says this must be correct toolchain to be used.

The contest targets RV32I so I setup build script for the Dhrystone using -march=rv32i/-mabi=ilp32 I am using unmodified files from riscv github, this is how far I get:

undefined reference to __umoddi3
undefined reference to __mulsi3
undefined reference to __divsi3

Errors come from syscalls.c from dhrystone.c and from dhrystone_main.c from all included C files!

Quick google search says that this errors happen when targetting 32 bit RISCV with toolchaint that is incorrectly configured - multilib option not enabled. What? The very web page where I got the toolchain says it is "multilib" toolchain? Does it mean multilib in some other context?

What about -march=rv32im ?

Wah - the errors from Dhrystone and Dhrystone_main disappeared only a few errors from syscalls.c remained!

Now this is important - this simple test clearly shows that if we are competing for the highest performance category we must implement RV32IM, this is not option, this is pretty much a requirement (well on the Microsemi platform at least).

Dhrystone uses strcmp function once, this is implemented in syscalls - if we manage to optimize it even a little we have gained some benefit, or we could just return correct result without performing the function - this would be faking of course. But if we do not optimize the strcmp maybe our competitor does and uses that performance boost to win? It is not really clear the status of the syscalls file, I would assume that it is OK to modify it, say those:

extern volatile uint64_t tohost;
extern volatile uint64_t fromhost;

Are in syscalls to TALK to the instruction set simulator, we are not however not doing tests in instruction set simulator, so we pretty much should modify syscall to match our embedded FPGA SoC ? I assume we can do it without violating the contest rules. OTOH it may also be possible to modify the SoftCPU FPGA SoC in such way that the syscalls from riscv github could be used without modifications?

Are you confused? I truly am.

to be continued...

Wednesday, October 10, 2018

RISC-V SoftCPU Contest

This contest was initially launched at ORCONF 2018 in Gdansk and is officially now hosted at riscv.org - RISC-V SoftCPU Contest.

Big question, what is evaluated in the contest: a SoftCPU core or SoC based on the SoftCPU core?
From the rules:

The entries will be RV32I-compliant soft CPU's.

But it is also clear that the SoftCPU core itself can not pass the minimum requirements as it has to be a complete FPGA implementation and it must run Philosophers and Synchronization examples of Zephyr RTOS. So to pass the requirements we must have a minimal SoC that uses our SoftCPU while for the resource utilization we should only count the resources used by the SoftCPU right? The SoC system bus and peripherals components are for sure not part of the CPU.

Requirements:

  1. RV32I-Compliant - exact documents not specified!
  2. For performance category Dhrystone is used
  3. Must run Zephyr 1.13 version keeping RTOS core (not specified) untouched
  4. Must pass (assumed ALL) RV32I compliance tests
  5. Must boot Philosophers and Synchronization (but they may fail?)
  6. Complete FPGA design for IGLOO2, SmartFusion2 or iCE40 UltraPlus
  7. Must use unmodified GCC toolchain provided by Zephyr
  8. Must use verilog and must be possible to simulate with verilator
  9. Must include binary version of the bitstream and instruction to build it
Let's look the requriements - from risc-V website we can get two ISM (Instruction Set Manual) one for user level ISA - version 2.2 and one for privileged ISA version 1.10 we must assume that compliance for RV32I can be figured out by study of those documents.

There is no way those documents describe the RV32I requirements cleanly, first there are user and privileged ISM documents. We absolutely have to follow the user ISM manual, this is absolutely sure. But what about the privileged ISM document? If we want (and we have too) pass the "compliance testing" in full we must implement part of the privileged stuff, there is no way around it. But it is not defined what parts - so what can we do?

We look the "compliance test" suite and list all the features that are exercised there and implement them using both ISM documents.

Some examples: privileged ISM describes WFI instruction not present in user ISM as this instruction is not tested by the compliance suite we do not need to implement it. But MRET we need to implement from privileged ISM as it used in compliance suite.

It is equally bad story with the CSR's - user ISM describes three 64 bit CSR as part of RV32I - but the compliance suite uses many more, and there are much more described in the privileged ISM. Which ones do we need to implement to be RV32I compliant?

There is no definitive answer to those issues.

Lets take mtval CSR it is described in privileged ISM it is also partially tested in the compliance suite but referenced by the old name mbadaddr from outdated ISM document. So mtval should contain the instruction CODE on illegal instruction trap, but as this is not checked we do not need to implement this right? We need to implement mtval only to the extent the compliance suite is testing it.

Another issue - what about register widths? In some cases some registers would always contain zeros in leading bits on given system can we hard wire them to 0? As example we can limit external memory region to be only 1 Mbyte - on small SoC in small FPGA all the available memory and peripherals would fit into that memory space. In this case program counter would not need to have full 32 bits? Same for some other registers. As those bits would always be 0 they could be implemented as constant 0? Or would that be considered as violation of the ISM document?

Very very complicated - the requirements "RV32I" are not clear at all.

User ISM document says that three 64 bit CSRs are mandatory (counters), however the ISM allows for low end implementations the upper 32 bits to be implemented in software! But no matter 32 or 64 bits we need 3 counters that are read only. Counters can be implemented using DSP blocks, this would save plenty of FPGA logic elements. So if we want to minimize logic use, we should use the FPGA architecture hard DSP IP Blocks for the those counters. Because if we do not this, our competitor would. It also seems that we can combine the RDCYCLE and RDTIME counters in one counter, there is nothing in any documents that would say it is not valid to do so.

Ok lets try to figure out the minimum CSR needed to satisfy the contest rules.

First - all unimplemented must be readable and return 0 (like misa and many more).
mtvec can be hard wired and read only, for optimization a vector with least 1's would be best
mscratch - needed used in compliance test
mtval - needed used in compliance test (but only partial functions are tested..)
mepc - needed (lower 2 bits should be 0)
mcause - needed but the number of exception codes supported is not clear
mip/mie - not clear if needed as the interrupt handling is not mandatory
mstatus - not clear if needed, zephyr core does save/restore it, but does not use directly

For minimal system supporting simplest RTOS we need at least a "tick" interrupt, this however could be implemented as external "tick" triggering NMI with special logic to be enabled (and disabled on power on reset) - this would allow us to exclude all interrupt related logic from the RV32I core.

OK, this is now really complicated, from the rules: zephyr "core" should not be modified, it is however not specified what is considered as part of zephyr core?

Maybe it refers to the riscv32 core arch?
https://github.com/zephyrproject-rtos/zephyr/tree/master/arch/riscv32/core
Or does it refer to some undefined set of files from undefined selected directories?

In any case it is not clearly visible what is the minimal required support for the systick/timer for the zephyr (in the way that "core" is not modified).

In short: the requirements for the RV32I are not clear.

Now lets look the grading for the smallest implementation, FPGA resource utilization, how is this calculated?

There are no weight on DSP and RAM vs LE (logic elements). Smallest number of total resources !? This can only be read as: every resource counts as 1 for the total count. So each DSP and RAM instance has same weight as Logic Elements. Now logic elements include LUT and Flip-Flops -sometimes only logic is used, sometimes only a flip flop, and sometimes both. Scoring does not count that, so if we have more Logic Elements that include both LUT and Flip Flop we win..

Based on the above to win in the smallest category, we should:
  1. Implement all adders and counters using math blocks.
  2. Implement register file using minimal number of RAM blocks
  3. Push as much as possible from the SoftCPU to the SoC subsystem
And now lets make it more complicated, the rules say that on the SmartFusion2 we should not use hard CPU subsystems as part of the SoftCPU, but hey there is 256K eNVM in IGLOO2, this resource is not part of hard CPU and it is not logic element or ram or math block. So if we use that resource as ROM lookup or microcode storage it would not count towards the resource utilization. Interesting uh? Interesting is also a note that "interesting ways" to enhance the design using the hard CPU subsystem can be implemented - do we get special points for this also? No idea, possible not.

Hmm.. "Hard CPU subsystem should not be used such" - but what about eSRAM? This gets things really complicated, lets say we implement the SoftCPU on IGLOO2 and use the eSRAM as register file, this is well kinda stupid, but it would not violate the rules! On IGLOO2 eSRAM is counted as part of the FPGA and if used would increase the utilization of the SoftCPU. But lets move the same design to SmartFusion2, now the same resource eSRAM is part of MSS and not counted as FPGA RAM Resource - so what now, is our design now against the rules, or should we "not count" eSRAM as it is not part of the FPGA Fabric? Complicated.

Again, short: the way of calculating the "resources" is not clear at all.

TIP 1: expose as low level and simple interface to external world as possible, push all the "bus adapter" and bus multiplexer code out from the SoftCPU core and this external bus should be as narrow as possible, say only 1 Mbyte total, this saves some logic too.

Target Devices with resources:

IGLOO M2GL025
LE (4LUT+FF): 27696
DSP 18x18: 34
eNVM: 256KB
Total RAM: 1104 Kbits includes eSRAM

SmartFusion M2S025
LE (4LUT+FF): 27696
DSP 18x18: 34
eNVM: 256KB - part of MSS
Total RAM (Fabric): 592 Kbits excludes eSRAM ?

ICE40 UP5K
LE (LUT4): 5280
Total RAM: 1024Kbit (no init!) + 120 Kbits
DSP 16x16: 8

Lets look at the links provided for evaluation boards:

iCE40 UltraPlus Breakout Board - this board has FT2232 for programming but not for UART (channel B has no connections) so we need to have extra TTL USB UART adapter or then emulate the Philosopher messages with morse code on LED's. Not cool. The board has SPI flash that we can use to store the software image to be loaded to the SPRAM that can not be initialized from the boot image.

iCE40 UltraPlus MDP provides both Programming and UART connection and SPI Flash.

Upduino V2 includes USB Programming and SPI Flash

iCEvision includes USB bootloader? I SELL ON tindie ? Number of orders since April is 3!? No documentation? No thanks :) not for me.

Well no matter what iCE board we would use, we would need to implement the SPI bootloader (or UART bootloader) to boot the zephyr code.

On the Microsemi boards we can use the eNVM for the zephyr boot code.

How to win the smallest implementation contest

If the grading is really done by the total resources per SoftCPU and performance is not at all evaluated then we could implement a single bit serialized core. I have done a partial bit-serial implementation of ARM Cortex M0 - it for sure reduces resources, RISC-V could even be simpler in bit serial mode than ARM Cortex. But when we look a the CSRs that also need to be implemented than the resulting bit-serial risc-V may not be the smallest implementation. So what options we have left? Sure - microcode implementation!

Here we have two options, we can use some existing soft Core (maybe tweaked and tuned) or we may create a new FPGA architecture optimized softCore designed to execute some microcode that emulates RV32I.

How small can it be? Short answer, it could be damn small. It would be stack based, use only one block RAM resource one accumulator (top of stack register).

There is one project that I made long time ago, that has an softCore that could for sure emulate RV32I to satisfy the rules of this contest, it was implemented in Microsemi (then Actel) ProAsic3 device A3P060 with the following resources:

Equivalent LE: 700
DSP: none
RAM Bits: 4608 bit (4 block x 512 Byte)
FlashROM bits: 1024

With A3P060 I implemented a "specialized ASSP" with following features:
* AVR like softCPU optimized ISA
* FlashROM was used to bootstrap initial code from SPI flash
* Code space was "banked" and I used loadable overlays from from SPI flash
* There was some logic for streaming reads from SD card in 4 bit mode
* And there was some other logic (some folks now what...)
* It was programmed using AVR Basic Compiler (product of Silicon Studio aka Antti Lukats)

This SoC system did fit into A3P060 !

I am be 100% sure that I could implement RV32I on that SoC (with very minor modifications) even if I need to load 512 word AVR code overlay for each RV32I instruction, see performance does not matter at all. So it would be still valid RV32I SoftCPU as of the rules.

But even that optimized AVR is too large, the microcode engine that would eventually win the smallest implementation could be much smaller. But then you would need to write an compiler for stack based CPU, something I have not yet managed.

So let me make forecast - smallest implementation for RV32I SoftCPU that satisfies the rules could be on both architectures as small as:

MicroCode engine for RV32I:
LE: 200 ? Maybe less, depends the time you spend to optimize the architecture
DSP: 2 ? One for PC increment and one for ADD function
Block RAM: 1 used for Stack and everything else

both the RISCV code as emulation engine microcode would come from external SPI flash so the microcode storage would not count towards the resources. Would be damn slow but that is OK for the smallest implementation. It would be smaller than bit-serial implementation without microcode.

So why am I am not doing it with 200 LE if I claim it could done? Well that is not a challenge for me, I know I could do it, so the only thing for motivation would be the prize money. And that is just too low. And well if I would submit 200 LE version, you could still beat me with 198 LE implementation, or maybe you do it with 156 or less? Nonsense. It makes no sense to optimize that low. The rules just do not make sense.

Sorry.

There is another thing - a 200 LE version (or maximum resource optimized with the goal to win) of RV32I that satisfies all the rules set for this contest is not meaningful to be used in any real life design. It would be an effort to get the prize money, not to create something that is useful and re-usable.

So let me try to setup some rules that would make sense for the smallest implementation.

NO-NONSENSE RISCV Contest Rules version 1:
  1. RV32I or RV32E
  2. Must run hello world compiled with "official" GNU RISCV GCC
  3. Minimal SoC should fit Lattice XO2-1200 executing code from SPI flash (XiP)
A SoftCPU that satisfies the above rules would be usable in real life projects. 
Hint: AVR SoC inside XO2-1200 leaves plenty of resources free for application specific peripherals. 

Bit-serial RV32E would be cool. In less than 1000 LUT ? Really Cool.

Now I should shake some prize money ? ;)

Well, bitserial RV32E would really make sense no matter if it fits 1000 LUT or runs on XO2-1200.

Tuesday, October 2, 2018

RISC-V Take I

RISC-V explained in two words: "Frozen ISA" - indeed this is all there is: a RISC Instruction set that is fixed and guarded by an foundation. Products designed to the fixed RISC-V ISA should remain compatible and working as long as they adhere to the ISA, that is the all point there is, nothing more.

There is however no known good implementation of RISC-V, there are many implementations but none of them is the golden reference.

picorv32 is nice and simple implementation written in verilog, with simple adapter to provide AXI bus support. It did take maybe an hour to convert it to Vivado IP Catalog IP core, so it looks in IP Integrator:

Now it takes just a few mouse clicks to create some simplest RISC-V SoC system:


For initial testing I just added AXI BRAM and then used Vivado automation to connect the bus infra for me.

Next is testing right? Now we need some C compiler or assembler at least for RISC-V, this should be simple right? Well no - it is not that easy to find windows executables for RISC-V. Well well, there is F32C project that does include those binaries, well binaries are actually hosted by FPGArduino web.

I have implemented F32C in FPGA before, I have used the IDE and compiled C code for f32c using their compiler. So it must work. It must produce valid and working code? Right? The compiled programs do work, I know they do.

As next I create smallest test program (in assembler) to compile a few instructions for the picorv32 system.This is what f32c/fpgarduino riscv compiler emitted in listing file:

   9              loop1:
  10 0010 63080000 beq zero,zero,loop1
  11 0014 6F000001 j loop1
  12 0018 6F000001 j loop1
  13             
  14              loop2:
  15 001c 630E0000 beq zero,zero,loop2
  16 0020 6F00C001 j loop2
  17 0024 6F000001 j loop1

What!? This can not be! Can it be that risc-v has NO relative jumps or branches !?
Look at the code generated, both jump and branch instruction addressing is absolute.

This can not be! Looking at RISC-V specification. It clearly says that both branches and jumps use relative addressing. I am looking at the listing again.. absolute addresses? But the code generated by this compiler works in FPGA, I know it does. It really does. Well if executed by f32c implementation of RISC-V...

I just cant believe it, did f32c developers really changed the relative addressing to absolute addressing? And then patched the GCC and provided binaries for this patched riscv toolchain for everybody to download? And wonder why nothing works?

I just can describe how I felt when I realized that the f32c/fpgarduino developers indeed are using absolute addressing compiler and broken softcore. I was not happy at all.

Next try, more search.. and found, gnu-mcu-eclipse toolchain binaries for all host OS :) !
It takes some reading to find different commandline switches needed:

riscv-none-embed-gcc.exe -c -march=rv32i -mabi=ilp32 -Wa,-adhln -g start.S >start.lst

OK, opcode for relative branch to own location (forever loop) is 0x63000000 and NOP is 0x13000000 this is all I wanted to know for starters.

Double click on the BRAM and then in COE file editor init values can be entered:

NOP
NOP
L1: beq zero,zero, L1
NOP

Now it is exciting, does picorv32 execute the branch correctly?


And it is working - program counter goes from 0 to 4 then to 8 where the branch instruction is, next location is also fetched, then execution continues again from location 8 the forever loop is working.

It is clearly visible in the trace that there is pretty large latency fetching the instructions from BRAM over AXI bus. Sure this is far away from optimal, but at least it works, and it is now instantly possible to create Vivado design with picorv32 that can use any of the AXI IP cores available for Vivado.
For faster BRAM access it would possible make sense to split the picorv32 bus address space between AXI and LMB and put the BRAM block on LMB bus. This is also not complicated, but right now not of primary importance.

Friday, August 17, 2018

Listening to Clock Jitter


I have tried to listen to clock jitter before but never succeeded in that, well until now!

Test setup: 12MHz clock derived either from external MEMS oscillator or from Lattice XO2 FPGA on-chip PLL generating 12MHz output from 12MHz reference input. The clock is feeding 8 bit shift register that is repeatedly sending out 8 bit pattern. Shift register output is driving FPGA IO pins, and those are connected via audio transformer to 600 ohm differential input of Focusrite USB Audio.

There is a push-button that selects between 12MHz from MEMS and from internal PLL.

Idle pattern 0x69 (..01101001..) low noise level is MEMS oscillator, switching to PLL clock add significant noise, that can take 2 discrete levels. This can only be if the "clock jitter" noise is also repeating with 8 (or some power of 2) clocks.

Idle patterns 0x55 (..01010101..) and 0x33 (..00110011..) did not show any noise difference between MEMS and PLL clock sources.

Interesting uh? Different silence pattern do exhibit so different output: 0x69 is sensitive to clock, while 0x55 and 0x33 are not. What about 0x0F ?


Pattern 0x0F creates much higher level tone than 0x69, and the discrete levels are now more similar to each other compared to the levels of 0x69 pattern.

This tone comes from 0x0F the frequency is about 3500 Hz.

So basically we have a signal chain like this

12MHz MEMS clock -> PLL -> 12MHz -> divide by 8 -> 3.5Khz tone! If we remove the PLL from the chain the tone is gone.

If the PLL makes such a clear tone out of 00001111 pattern, it should also have some impact on DSD audio as well?

Let's try, I start DSD audio data streaming from the PC to the same hardware setup (8 bit shift register..) and yes, the switch between MEMS clock and PLL clock clearly adds white noise when PLL clock is selected.

So confirmed - clock jitter does have impact on DSD audio quality. OK, yes the XO2 PLL clock must be indeed really bad in terms of "audio quality" but this experiment clearly shows that better clock gives better results.

I can now use this or similar setup to compare clock jitter from different sources.

Monday, August 6, 2018

Fast-Serial, Part II

Taking the FPGA code for the fast-serial "echo" to Vivado did take, well no time!



The "RTL" block includes exactly the same code I used previously on Gowin and Lattice boards. I even left the "LED counter" port there with connection to VIO (Virtual I/O).

This is how the setup looks like XO2000 + TE0723

First attempt did return garbage and not correct echo, this is possible due to timing issues, I used now maximum serial interface clock of 50MHz. In the test with Gowin/Lattice I had clock of only 12MHz and did tweak the interface to work by using negative edge flip-flop. Now the path from FTDI to the "processing" FPGA (Zynq on TE0723) has additional not constrained delay introduced by the XO2 FPGA on XO2000 board. So for the functional test I am reducing the clock to around 12MHz - and now I have correct echo, same as it was with the same code on Gowin and Lattice boards.

Time to write the actual IP Core now!

First I add AXI-Stream ports to the "top" VHDL file I used previously as design top and then I add this as module to newly created Vivado BD and it looks then like this:
This is nice, I did not expect that Vivado would auto detect the AXI-Stream Interface, but it did and correctly too! As next step I create VHDL wrapper to be my Testbench toplevel file and add 7 lines of VHDL there. Testbench is ready!
Now the real work begins, after about 22 lines of VHDL code I am ready with "receive only" IP Core, time to simulate, does it work as expected? Yes it does, back to real life. I add ILA (Integrated Logic Analyzer) to the BD I used before, so it looks now like this:
Now a real test, sending "Hello World!" Core works and indeed it does.
This is ILA snapshot from "12345678901234567890" sent with putty. There is a thing that FT2232 datasheet does not tell you, it seems there is always 3 extra bits of IDLE between bytes received.

Next step? Adding transmit portion, testing it and then wrapping it into Vivado Catalog IP Core...

Sunday, August 5, 2018

FTDI Fast-Serial Mode with FPGA

FTDI USB devices FT2232H (& Co) have one interesting but rather seldom used interface type called "high-speed opto-isolated interface". While the name implies it was initially designed for opto-isolation there is no need to use the isolator, the interface can be used without it equally well. The benefits of the interface is that we only need 4 I/O Pins total for bidirectional data transfer (max clock rate 50MHz!).

I have been playing with the idea to use this interface in some projects for some while already. The first successful test was done within a few hours after I got an small nice and nifty board with FT2232H and small FPGA from China Gowin Semiconductor.

LittleBee board from Trenz Electronic
Needless to say the board is designed by me too. Well to test the interface as quick as possible I tried the "uart echo" approach, I connected the FSDO data from FT2232 back to FSDI input of FT2232 (with some clocks delay implemented as shift register) and provided some free running clock on FT2232H clock pin FSCLK. I hoped this would work as "echo" and indeed after programming the FT2232 Channel B to "fast serial" mode using FT_PROG utility from FTDI it really worked, I got echo using putty as UART terminal program.

This is of course no actual real use yet, but it shows that the simple way of using standard COM port drivers to send high speed data really works. If we want to talk to the "application" in the FPGA we use standard serial port drivers be it on Linux or Windows host (Virtual COM port drivers are provided by FTDI).

Next Step - some real use ? For this I need to write some more complex FPGA code than the serial loop-back I used for initial testing. I could continue using the LittleBee board. But I would prefer to use my favorite environment Vivado for the IP Core development. I have FPGA boards that include Xilinx FPGA and have FTDI Channel B connected to the FPGA I/O Pins (as example TE0723 Arduino-Zynq board) but well there could be problems when trying to connect to channel B fast-serial and at the same time use channel B as Vivado JTAG (actually I tried it once and really did face such problems). Also this usage would need FT_PROG to be used on TE0723 an action that would destroy the Xilinx JTAG License ID in the FT2232 User EEPROM.

Solution? Let's use TE0723 for FPGA development and debug but with FT2232H connected externally to the PMOD on TE0723. This can be easily done, there is not even need for flying wires for this, I can create a FT2232 "custom breakout" from say a Lattice FPGA board XO2000
XO2000 board from Trenz Electronic
I do have several REV 1 PCB's that are "not for resale" - utilizing one them as FTDI fast-serial to PMOD adapter would be good use. In order to "mate" properly I solder 2 row male pin-header from the bottom of the PCB. Hmmm - I should probably cut the VCC pins in order to not short-circuit the 3.3V supplies of the two boards? Yes it would have been easier to solder 2x5 header instead of 2x6 header in the first place.

Porting the code from Gowin to Lattice Diamond was done in less 30 minutes and "echo" test performed within an hour or so. So far so good.

Taking out the "echo" code and replacing it with "wiring only" code that directly connects signals from FTDI Channel B to PMOD pins takes another 30 minutes.

Done, now I can proceed with development with Vivado, the fast-serial interface comes from second FT2232 with no Xilinx JTAG license, so Vivado would not see it at all.

I guess the first step would be taking the echo loop-back code that is tested and working and implement it in Xilinx FPGA to verify the setup and pin-mapping is correct. Then it would be time to proceed with real IP Core development for the fast-serial mode.

(to be continued)

Monday, April 2, 2018

MEGA65 Computer Build with Vivado

Setup of HARDWARE bench

This is the easy part

Setup of the Build Environment

I assume this being simple as well, as there was Vivado tcl script to recreate the Vivado project.

Try #1

I am trying to start the tcl script from Vivado console. And yes I had forgotten that the src/vhdl directory does not contain all source code files at all. Yes, yes many of the files needed are generated files and not original VHDL files.

Try #2

Oh, the main toplevel Makefile - it really does not look like fun trying to build it on Windows/cygwin so I start linux in VM.
Ok, the main reason I assumed that the original Makefile is hard target for Windows was the presence of tool called Ophis.
But well Ophis is a cross assembler that is written in Python and has built in support for Windows. So probably Ophis would have run on my Windows box without problems as I do have Python 3.6 installed and available from console prompt. The lack of proper make is another problem, the only make that is on Windows path is the one from Delphi :) that one is no good for building gnu Makefiles. I did set up a batch file to set path to C:\Xilinx\SDK\2017\gnuwin\bin to execute some reasonable make but after removing the "prg" targets, well the Makefile choked on missing rule for ".git" target. After some more trials I did proceed on linux VM.
Now the PRG targets are build, but what, CA65 is missing? Why and where should I pull-in this? OK lets see what else can we build.

make tools

After installing libpng-devel tools target is building without errors.

Try #3

Pulling in CC65 source code, running make. Build succeeds now lets copy the ca65 and ld65 where the mega makefile can find them! After adding ./ into the Makefile ca65/ld65 are found and executed. Now we are stuck on cbmconvert :(

Try 4

OK cbmconvert build also completed, this time I install it too so I do not need to add ./ into the Makefile. Now we are stuck on .git target missing. Well this I had on initial trial on Windows too.

Try 5

Searching for problem why .git is not found. Ah so stupid, I did copy the files into the Linux VM and did not fetch them directly from git, so the ".git" directory was missing. A dirty trick

git init

and we are in game again. Well now we are stuck on the next missing part - convert is missing.

Try 6

sudo yum ImageMagick

and we are stuck again on? Seems something is wrong in /core/src/mega65-fdisk folder. Well but at least some VHDL files needed from Vivado have been already created, lets see if something is still mssing. I am copying the VHDL files back from Linux VM to the main workstation hard disk. There are 5 new VHDL files generated, but the colourram.vhdl is still missing. I could try excluding it, but let see if we can force the Linux build one step further.
Ok, the issue with fdisk is very simple, I did not download the git submodules so I do that manually now and copy the files again into the linux box.
Unfortunately I am stuck again, this time the error comes from c code, complaining that -std=c99 option should be used. This is well known issue with lots of linux projects. For some reason adding the switch into the Makefile does not help so I fix the source code that is faster and easier. And now we are ready for Vivado!

Yeah!

Try 7




Now the fun can begin.
A mega65 fun!

Links

Pointers to resources

Ophis 

Cross compiler in Python under MIT license. Web pages from the original author are no longer online, his github account seems still be active with some commits in early 2017 but Ophis should be fetched from forked repo.

CA65/LD65

Those utilities are needed to build MEGA65 Vivado files, they can be fetched from github and there is also Windows snapshot available here.

cbmconvert

From github here. Assumed somewhere is windows build available also.





Sunday, March 25, 2018

Visit Estonia

Just some random hints for shorter or longer vacation to Estonia and other countries around Baltic sea.

When? (Weather)

July or August. That's it. There are usually some very nice days in May and September too but you are most likely to miss them if you visit only for a short time. June should also be OK. So one possible date plan is to include Jaanipäev Estonian summer Holiday. Please note that during the summer holiday the local folks (not only in Estonia) tend to use lots of alcohol!

Things to (NOT) worry

Ticks. They can be found everywhere in the Baltic states. Two types exist the red ones and the grey ones, both usually harmless. I used to be bitten by them multiple times each summer that I spent on Island Hiiumaa. But should you get a tick, please, please do not try to pull it off yourself, seek a doctor and immediately. I got recently a Deer Tick on Bonny Doon Road close to Santa Cruz California. It was a real bad thing. Such ticks do not live in Estonia to my knowledge, but, still please do not take any risk. You can read here some more about ticks.

Short trip with airplane

You can fly to-from either Tallinn, Helsinki or Stockholm whatever you get best flight connections. You should also count for accessibility from airport to city center this is possible fastest and easiest in Tallinn. In Tallinn you can either take a Taxi (does not cost a fortune!) city bus or even walk if you have light luggage. So assuming you fly to-from Tallinn (TLL) you can take night ferry to Stockholm spend a day there then proceed with night ferry from Helsinki. After a day in Helsinki you take next night ferry to Tallinn. From Tallinn you can take your return flight back home.

This way you get 3 countries with one single round-trip airplane ticket and with the need to spend a night in hotel. You can of course extend you stay in any of the visited countries. And you can also opt in for fast jet-boat on the route from-to Helsinki-Tallinn.

This trip can be done with light airplane carry on luggage only. And without paying a dime to taxi or bus for airport transfers.

Considerations: Airport access to city center is best in Tallinn. The maritime terminal in Helsinki is different for different vessels. So while from fast boat you may land very close to city center then for the overnight ferry you may have to walk a few kilometers.

Short trip with a car

Lets say we start from Germany. During the drive in Poland you see where all the EU money goes:
Rabbit bridges!

Please keep some Polish money ready you need to pay toll many many times on the road. One place to stay on the road is Hotel Trylogia - it s located very close to the optimal route that bypasses Warsaw.

Next thing to note is the need to avoid crossing the border of EU. This is possible but you need to be very persistent against your GPS guidance - the route that within EU borders is about 70 km longer as the one crossing the borders. Chances are that GPS will try to route you to White Russia for an hour or so. In Bialystok you must proceed to Augustow this is very important. Next stop you take say in Resort Hotel Egles
You can fill your bottles with mineral Water:
Or have cup of coffee here:  
 Of course you can stay for a night or more there too. And in the lobby for the main eating room there are tourist agents lurking around. You may use them if you wish, they are able to get you to visa free DAY TRIP to White Russia! You just must return same day.
If lucky you can find this place in Druskininkai :) tasty! 
This is where Estonian national ARDF team did have a cake after Gold Medal from ARDF2017 Competition.

Druskininkai you an rent a bike either in Egles or in the city, the selection is large:
If you like mummies you can take boat trip
To them:
Mummies are there... 
For Pokemon Go players: my Togetic is in the Gym about 800 meters from those mummies.

You can also visit the local Aqua Park they have a rather large Sauna-World too. And from that place there is a lift to the Snow Arena.

From Druskininkai it takes one day drive to Estonia. From Tallinn Estonia you can proceed with Ferry to Helsinki, then next Ferry to Stockholm, and then back to Germany either same route or down the the roads of Sweden, your choice.

Places to go in Tallinn

If you want to buy some yarn or needlework stuff the best place is kl24 it is a wholesale warehouse open to anyone. You may have trouble finding the entrance and getting in (you may have to press a call button named: "Karnaluks"!) and when you enter you need to surrender all your bags. But then you can shop on several floors. On the way to KL24 you can jump in to Cafe "Poska" located at Laulupeo 1.
Please note google earth photos from that street corner are really outdated!
After this meal you can proceed to KL24 (it less than 500 meters away) and do some shopping!
If you have a car a very good place to eat is a place called NOY.
Above is my free meal. And yes there is such thing.

In Tallinn should you decide to go to beach Pirita you will be passing by a steel monument
My Father had a minor accident (broken arm) at the opening ceremony with an ultra-lite airplane. This monument also serves as memory to my Mother (very successful Parachute sportsmen) who is resting at the center of Gulf of Tallinn. Please hold you breath for a second or two and have a smile.

On the marble foot of the monument there is an in-script in Estonian: "Julgetele ja teotahtelistele inimestele". My plea if you think you have a good translation of this sentence to some other language please drop me a short email. My own very bad translation to English is: 
"For the brave ones willing to act". 
It's bad translation I know but I know no better.