Is it possible to reduce the space requirement of the binary operations tree on the FPGA by less than 2 times the bandwidth?

I have a circuit where in every clock cycle there are N 32-bit inputs to be computed. I have a binary operation that accepts two 32-bit inputs and gives one 32-bit output. This operation is associative, and I would like to apply it to all N-bit inputs of N to get one 32-bit output. I am currently achieving this by implementing a pipelined binary tree of operations.

For a specific example, suppose that N = 4 and I have inputs {a, b, c, d}, then I would do the following:

a op b => reg1 c op d => reg2 reg1 op reg2 => result 

When a stage in the tree is not divisible by 2, I insert a dummy operation that takes only 1 input, gives output 1 with the same delay.

The problem that I have is that I am interested in several sizes of N inputs {9, 25, 49, 81, 121}. The largest size N, 121, requires 110% luts in my FPGA fabric, while all other sizes fit easily. The tree of these binary operations is by far the largest consumer luts in my design.

I know that I could reduce the use of luts by almost half, halving the number of op circuits on my board and multiplexing their inputs. Unfortunately, this would mean getting the result only for each cycle and halving the bandwidth.

Since a full tree requires only 10% more resources than the board offers, a 50% reduction in bandwidth seems too significant due to the impact. Is there an architecture in which I can compromise with finer-grained downsizing to fine-grained bandwidth reduction?

+5
source share
2 answers

Have you considered using 3, 4, 5, or 6 input operators? LUTs in the current Altera and Xilinx FPGAs have 6 input signals. If your statements are really simple (i.e. AND, OR, XOR), you can spend a lot of LUTs in your current implementation using only 2 input functions instead of 6 input functions (provided that you register each stage of the pipeline). This will turn your binary tree into a k-dimensional tree of operations. With your numbers, many LUTs will be dropped depending on N.'s choice.

Here is a choice of K that would make sense:

  N | K | LUTs/bit | LUTs | LUTs/bit @k=2 | LUTs @k=2 -----|---|----------|------|---------------|----------- 9 | 3 | 4 | 128 | 11 | 352 25 | 5 | 6 | 192 | 25 | 800 49 | 5 | 13 | 416 | 50 | 1600 81 | 6 | 18 | 576 | 75 | 2400 121 | 6 | 26 | 832 | 123 | 3936 
+3
source

Ok, I took your example with 9 inputs and tried to solve a binary tree problem with 75% of binary operators.

I call the inputs a,b,c,d,e,f,g,h,i . To distinguish the input values ​​in time, I will add a check mark after a.

 a' = a at time 1 a''' = a at time 3 (ab)' = result of a' and b' 

9 inputs require 8 binary operations. I limit my processing scheme to 6 operators, so it uses 75% of the necessary operators. Each line in the following diagram represents one statement.

 time 1 time 2 time 3 time 4 | time 1+4 a'b' (abcd)'(efgh)' (ab)''(cd)'' e'''f''' | a'b' c'd' (abcdefgh)'i' (ef)''(gh)'' g'''h''' | c'd' e'f' a''b'' (abcd)''(efgh)'' (ab)'''(cd)''' | e'f' g'h' c''d'' (abcdefgh)''i'' (ef)'''(gh)''' | g'h' (ab)'(cd)' e''f'' a'''b''' (abcd)'''(efgh)''' | (ab)'(cd)' (ef)'(gh)' g''h'' c'''d''' (abcdefgh)'''i''' | (ef)'(gh)' 

After 4 cycles, the pattern repeats. Thus, processing 3 input sets requires 4 clock cycles:
=> 33% overhead.

This circuit can be sorted so that each line only processes 2 different inputs:
=> the input can be multiplexed by a 2: 1 multiplexer.

 time 1 time 2 time 3 time 4 | time 1+4 a'b' a''b'' a'''b''' (ab)'''(cd)''' | a'b' c'd' c''d'' c'''d''' (ef)'''(gh)''' | c'd' e'f' e''f'' (ab)''(cd)'' e'''f''' | e'f' g'h' g''h'' (ef)''(gh)'' g'''h''' | g'h' (ab)'(cd)' (abcd)'(efgh)' (abcd)''(efgh)'' (abcd)'''(efgh)''' | (ab)'(cd)' (ef)'(gh)' (abcdefgh)'i' (abcdefgh)''i'' (abcdefgh)'''i''' | (ef)'(gh)' 

If I made no mistakes, this should be a working network:

schema
(Interactive)

+2
source

All Articles