Antti-Brain: December 2018

This is just an exploration of GRVI Phalanx Microarchitecture based on open and public presentation papers, some claims from Gray Research about GRVI Microarchitecture:

Datapath: 250 LUT (LUT6 or total LUT?)
PE complete: 320 LUT at 375 MHz
PE + share of cluster: ~480 LUT
4000 LUT per cluster
3 pipeline stages
2 cycle load/3 cycle taken branches/jumps
Two pairs of operand multiplexers
ALU, PC unit and comparator use carry logic
RPM optimized to almost max?
Datapath diagram and floorplan:

GRVI Datapath RPM: 35 CLB (280 LUT6) we should assume that most (or all?) of the datapath must be constrained into this region. Red are 5 CLB for the register file, green is ALU - clearly visible are carry chain blocks. Visible utilization:

Carry chains: 32 bit (ALU, green), two times 24 bit ? what !?
LUTRAM: 40 LUT (5 CLB) dual port register file (red)
Total used LUT6 (visible): 229
Total used FF (visible) 145

Due to the picture resolution the count may not be fully accurate. Also only count for LUT6 is given, many LUT6 are used as dual LUT5, there are at least 140 LUT6 used as dual LUT5 so if counting LUT (per LUT output used, exluding LUTRAM) we would get 370 LUT for datapath and if we count LUTRAM as dual LUT, the count would be 410 LUT for the datapath alone, pretty much exceeding the advertised count of ~250 LUT. When we count only LUT6 then there about 20 LUT missing from the RPM plan view.

GRVI Datapath from official presentation slides and documents.

GRVI Datapath with added details. Two pairs (2 x 2 = 4?) of operand multiplexers can not be true as there is really no need to multiplex both inputs to the compare unit. It is much more likely that the multiplexers are arranged as visible in detailed datapath; those multiplexers and 4 times 32 bit registers still fit to 8 CLB.

RPM Block with comments about guessed function block locations - something must be wrong. If we assume that the datapath from the diagram is fitted into the RPM map, and that that visible resource utilization is correct - there are still some functions that are not located: selector for immediate operand needs at least 19 LUT6 (2.5 CLB), this is just not there. The result multiplexer can not be 5:1, there is a trick to avoid that but for that to work the ALU (or DIN multiplexer in shared area) must be able to return zero as result, that would allow the 5:1 multiplexer to be used only for lowest ALU bit.

List (not complete) of minimal resources needed for GRVI PE (assuming 4K IMEM):

40 LUT6: Register file (5 CLB, minimal 37 LUT6?)
32 LUT6 + Carry: ALU (4 CLB)
32 LUT6 + ??: Compare unit
19 LUT6: immediate operand generator (2.5 CLB)
5 LUT6+10 FF: next_pc + if_pc (1 CLB)
10 LUT6 + 10FF: pc_incr + dc_pc (1.5 CLB)
32 LUT6: result mux (4 CLB)
5 FF: register file Rd address latch
1 LUT6 + 1 FF: register file write decoder and latch
? LUT: operand mux decoder (DC stage)
? LUT + ? FF: execute stage decoder/latch
? LUT + 2 FF: pipeline stall logic

Commentary based on public info and images from Gray Research:

Instruction latch for DC stage implemented as BRAM primitives register
There is absolutely no reasonable explanation for two 24 bit long carry chains!
If PC unit really uses carry chain then the visual RPM map is incorrect
If compare unit uses carry chain then the visual RPM map is incorrect
Resources for immediate operand selector can not be located, outside visible window?
Store operations may take more than 1 cycle if arbitration lost
Load operations may take more than 2 cycle if arbitration lost
ALU does not use DSP (there are some 3rd parties saying it does)

Open question: Why is there result multiplexer datapath back to ALU? The only use of this would be if shift operations are implemented as loops.

FPGA vendors should know how to generate RAM from the primitives effectively one may think. I did at least. Until I tried it out for the RISC-V small implementation.

There are two memory blocks made with Vivado IP Integrator. For some reason this small RISC-V SoC did show resource utilization over 200 LUT while I know it should be a smaller than that. After looking at detailed report there was 24 LUT and 3 Flip-Flops consumed in 32KByte 8 bit RAM. How can this be, 32Kx8 bit memory should use 8 BRAM primitives and 0 LUT. Checking out in RTL view after synthesis:

Ok this explains part of the problem, BRAM's are configured as 8 bit wide with 8 to 1 multiplexer at the output. This generates some LUT, but where did those 3 flip flops come? Looking again in post implementation RTL:

Right they are needed the address must be delayed for one clock for the output multiplexer to work properly, so those 3 flip flops are really needed.

So when the complete RISC-V soft CPU takes 59 Slices then the "extra added overhead" from Xilinx RAM generator takes 11 Slices! Checking out configuration options:

So where is 32kx1 ? This would be the one to choose when making 32K deep memory, this options is simply missing. Lets try what happens if we select 16kx1 - OK this is looking better, this time the RAM synthesizer is using 32kx1, well selected was 16kx1, so the generator must have guessed my mind and did what I wanted.

Both memory blocks are using now only BRAM and no logic resources.

Looks nice and works too!

Antti-Brain

Monday, December 24, 2018

GRVI de-mystified - Part I

Thursday, December 13, 2018

Xilinx BRAM generator, some tricks