Understanding Fixed Point Arithmetic

Question

Understanding Fixed Point Arithmetic

I am struggling with how to implement arithmetic on fixed-point numbers of varying precision. I read an article by R. Yates , but I'm still lost. In the future, I will use the Yates notation, in which A(n,m) denotes a signed fixed-point format with integer bits n , bit bits m and n + m + 1 as a whole.

A short question : how exactly are A(a,b)*A(c,d) and A(a,b)+A(c,d) satisfied when a ! = c and b ! = d ?

The long task . In my FFT algorithm, I generate a random signal from -10 V to 10 V, a signed input (s) that scale to A(15,16) , and twiddle (tw) coefficients scale to A(2,29) . Both are stored as int s. Something like that:

 float temp = (((float)rand() / (float)(RAND_MAX)) * (MAX_SIG - MIN_SIG)) + MIN_SIG; int in_seq[i][j] = (int)(roundf(temp *(1 << numFracBits)));

And similarly for twiddle factors.

Now i need to execute

res = a*tw
Questions :
a) how to implement this?
b) If res is 64 bits?
c) Can I make "res" A (17,14), since I know the ranges a and tw ? if so, should I scale a*tw to 2 ^ 14 to keep the correct value in res ?
a + res
Questions :
a) How to add these two numbers in different Q formats?
b) if not, how to do it?

+6

c fixed-point

py sqr Jun 27 '16 at 13:56

source share

2 answers

Your question seems to suggest that there is one right way to perform the operations you are interested in, but you are clearly asking about some of the details that determine how the operations should be performed. Perhaps this is the core of your confusion.

res = a*tw

a is represented as A (15,16), and tw is represented as A (2,29), therefore its natural representation of their product is A (18,45). To ensure full accuracy, you need more bits of value (as many bits as two factors). A (18,45) is how you should interpret the result of expanding your int to a 64-bit signed integer type (e.g. int64_t ) and computing their product.

If you really don't need or don't need 45 bits of a fraction, you can really round it to A (18,13) (or to A (18 + x, 13-x) for any non-negative x) without changing the magnitude of the result. This requires scaling. I would probably implement it as follows:

 /* * Computes a magnitude-preserving fixed-point product of any two signed * fixed-point numbers with a combined 31 (or fewer) value bits. If x * is represented as A(s,t) and y is represented as A(u,v), * where s + t == u + v == 31, then the representation of the result is * A(s + u + 1, t + v - 32). */ int32_t fixed_product(int32_t x, int32_t y) { int64_t full_product = (int64_t) x * (int64_t) y; int32_t truncated = full_product / (1U << 31); int round_up = ((uint32_t) full_product) >> 31; return truncated + round_up; }

This avoids several potential problems and implementation-specific characteristics of signed integer arithmetic. It is assumed that you want the results to be in a consistent format (that is, depending only on the input formats, and not on their actual values), without overflow.

a + res

Addition is actually a bit more complicated if you cannot rely on operands to initially have the same scale. You need to rescale to fit before you can add. In general, you cannot do this without rounding off some accuracy.

In your case, you start with one A (15.16) and one A (18.13). You can calculate the intermediate result in (19,16) or wider (presumably A (47,16) in practice), which preserves the value without loss of accuracy, but if you want to represent it in 32 bits, then the best thing you can do without the risk of changing the value of A (19.11). It would be like this:

 int32_t a_plus_res(int32_t a, int32_t res) { int64_t res16 = ((int64_t) res) * (1 << 3); int64_t sum16 = a + res16; int round_up = (((uint32_t) sum16) >> 4) & 1; return (int32_t) ((sum16 / (1 << 5)) + round_up); }

The general version should take the scales of operand representations as additional arguments. This is possible, but it is enough to chew as it is.

All of the above assumes that the fixed-point format for each operand and result is constant. This is a more or less distinctive feature of a fixed point, distinguishing it from floating point formats, on the one hand, and from arbitrarily accurate formats, on the other. However, you have an alternative to resolving formats and tracking them from a separate variable to a value. It would be mostly a hybrid of fixed and arbitrary precision formats, and it would be more random.

In addition, the above suggests that overflow should be excluded at all costs. It is also possible that operands and results will be placed instead on a consistent scale; this would simplify addition and multiplication, and it would provide the possibility of arithmetic overflow. However, this may be acceptable if you have reason to believe that such an overflow is unlikely for your specific data.

0

John bollinger Jun 27 '16 at 16:28

source share

anatolyg · Accepted Answer · 2016-06-27T15:46:28+0000

Maybe the easiest way to give an example.

Suppose you want to add two numbers, one in format A(3, 5) , and the other in format A(2, 10) .

You can do this by converting both numbers to the “common” format, that is, they must have the same number of bits in the fractional part.

A conservative way to do this is to choose more bits. That is, convert the first number to A(3, 10) by moving it 5 bits to the left. Then add the second number.

The result of the addition has a larger format range plus 1 bit. In my example, if you add A(3, 10) and A(2, 10) , the result will be in the format A(4, 10) .

I call this the “conservative” way, because you cannot lose information - this ensures that the result will be presented in a fixed-point format without loss of accuracy. However, in practice, you will need to use smaller formats for your calculation results. To do this, consider these ideas:

You can use a less accurate format as your overall presentation. In my example, you can convert the second number to A(2, 5) by shifting the integer to the right by 5 bits. This will lose accuracy, and usually this loss of accuracy is not problematic because you will add a less accurate number to it anyway.
You can use 1 bit for the integer part of the result. In applications, it often happens that the result cannot be too large. In this case, you can select 1 fewer bits to represent it. You might want to check if the result is too large, and clamp it to the desired range.

Now, when multiplying.

You can immediately multiply two numbers with a fixed point - they can be in any format. The result format is the “sum of input formats” —all parts added together — and add 1 to the integer part. In my example, multiplying A(3, 5) by A(2, 10) gives a number in the format A(7, 15) . This is a conservative rule: the output format is able to save the result without loss of accuracy, but in applications you almost always want to reduce the accuracy of the output, because it is too many bits.

In your case, when the number of bits for all numbers is 32, you probably want to lose accuracy so that all intermediate results have 32 bits.

For example, multiplying A(17, 14) by A(2, 29) gives A(20, 43) - 64 bits are required. You should probably cut 32 bits out of it and throw away the rest. What is the range of the result? If your twiddle coefficient is a number up to 4, the result is probably limited to 2 ^ 19 (the conservative number 20 above is necessary to accommodate the edge case of multiplying -1 << 31 by -1 << 31 - you should almost always abandon this edge - case) .

So use A(19, 12) for your output format, i.e. remove 31 bits from the fractional part of your output.

So instead

 res = a*tw;

you probably want

 int64_t res_tmp = (int64_t)a * tw; // A(20, 43) if (res_tmp == ((int64_t)1 << 62)) // you might want to neglect this edge case --res_tmp; // A(19, 43) int32_t res = (int32_t)(res_tmp >> 31); // A(19, 12)

Understanding Fixed Point Arithmetic

More articles: