Antti-Brain: GRVI de-mystified

This is just an exploration of GRVI Phalanx Microarchitecture based on open and public presentation papers, some claims from Gray Research about GRVI Microarchitecture:

Datapath: 250 LUT (LUT6 or total LUT?)
PE complete: 320 LUT at 375 MHz
PE + share of cluster: ~480 LUT
4000 LUT per cluster
3 pipeline stages
2 cycle load/3 cycle taken branches/jumps
Two pairs of operand multiplexers
ALU, PC unit and comparator use carry logic
RPM optimized to almost max?
Datapath diagram and floorplan:

GRVI Datapath RPM: 35 CLB (280 LUT6) we should assume that most (or all?) of the datapath must be constrained into this region. Red are 5 CLB for the register file, green is ALU - clearly visible are carry chain blocks. Visible utilization:

Carry chains: 32 bit (ALU, green), two times 24 bit ? what !?
LUTRAM: 40 LUT (5 CLB) dual port register file (red)
Total used LUT6 (visible): 229
Total used FF (visible) 145

Due to the picture resolution the count may not be fully accurate. Also only count for LUT6 is given, many LUT6 are used as dual LUT5, there are at least 140 LUT6 used as dual LUT5 so if counting LUT (per LUT output used, exluding LUTRAM) we would get 370 LUT for datapath and if we count LUTRAM as dual LUT, the count would be 410 LUT for the datapath alone, pretty much exceeding the advertised count of ~250 LUT. When we count only LUT6 then there about 20 LUT missing from the RPM plan view.

GRVI Datapath from official presentation slides and documents.

GRVI Datapath with added details. Two pairs (2 x 2 = 4?) of operand multiplexers can not be true as there is really no need to multiplex both inputs to the compare unit. It is much more likely that the multiplexers are arranged as visible in detailed datapath; those multiplexers and 4 times 32 bit registers still fit to 8 CLB.

RPM Block with comments about guessed function block locations - something must be wrong. If we assume that the datapath from the diagram is fitted into the RPM map, and that that visible resource utilization is correct - there are still some functions that are not located: selector for immediate operand needs at least 19 LUT6 (2.5 CLB), this is just not there. The result multiplexer can not be 5:1, there is a trick to avoid that but for that to work the ALU (or DIN multiplexer in shared area) must be able to return zero as result, that would allow the 5:1 multiplexer to be used only for lowest ALU bit.

List (not complete) of minimal resources needed for GRVI PE (assuming 4K IMEM):

40 LUT6: Register file (5 CLB, minimal 37 LUT6?)
32 LUT6 + Carry: ALU (4 CLB)
32 LUT6 + ??: Compare unit
19 LUT6: immediate operand generator (2.5 CLB)
5 LUT6+10 FF: next_pc + if_pc (1 CLB)
10 LUT6 + 10FF: pc_incr + dc_pc (1.5 CLB)
32 LUT6: result mux (4 CLB)
5 FF: register file Rd address latch
1 LUT6 + 1 FF: register file write decoder and latch
? LUT: operand mux decoder (DC stage)
? LUT + ? FF: execute stage decoder/latch
? LUT + 2 FF: pipeline stall logic

Commentary based on public info and images from Gray Research:

Instruction latch for DC stage implemented as BRAM primitives register
There is absolutely no reasonable explanation for two 24 bit long carry chains!
If PC unit really uses carry chain then the visual RPM map is incorrect
If compare unit uses carry chain then the visual RPM map is incorrect
Resources for immediate operand selector can not be located, outside visible window?
Store operations may take more than 1 cycle if arbitration lost
Load operations may take more than 2 cycle if arbitration lost
ALU does not use DSP (there are some 3rd parties saying it does)

Open question: Why is there result multiplexer datapath back to ALU? The only use of this would be if shift operations are implemented as loops.

Antti-Brain

Monday, December 24, 2018

GRVI de-mystified - Part I

No comments:

Post a Comment