Your question seems to suggest that there is one right way to perform the operations you are interested in, but you are clearly asking about some of the details that determine how the operations should be performed. Perhaps this is the core of your confusion.
a is represented as A (15,16), and tw is represented as A (2,29), therefore its natural representation of their product is A (18,45). To ensure full accuracy, you need more bits of value (as many bits as two factors). A (18,45) is how you should interpret the result of expanding your int to a 64-bit signed integer type (e.g. int64_t ) and computing their product.
If you really don't need or don't need 45 bits of a fraction, you can really round it to A (18,13) (or to A (18 + x, 13-x) for any non-negative x) without changing the magnitude of the result. This requires scaling. I would probably implement it as follows:
int32_t fixed_product(int32_t x, int32_t y) { int64_t full_product = (int64_t) x * (int64_t) y; int32_t truncated = full_product / (1U << 31); int round_up = ((uint32_t) full_product) >> 31; return truncated + round_up; }
This avoids several potential problems and implementation-specific characteristics of signed integer arithmetic. It is assumed that you want the results to be in a consistent format (that is, depending only on the input formats, and not on their actual values), without overflow.
- a + res
Addition is actually a bit more complicated if you cannot rely on operands to initially have the same scale. You need to rescale to fit before you can add. In general, you cannot do this without rounding off some accuracy.
In your case, you start with one A (15.16) and one A (18.13). You can calculate the intermediate result in (19,16) or wider (presumably A (47,16) in practice), which preserves the value without loss of accuracy, but if you want to represent it in 32 bits, then the best thing you can do without the risk of changing the value of A (19.11). It would be like this:
int32_t a_plus_res(int32_t a, int32_t res) { int64_t res16 = ((int64_t) res) * (1 << 3); int64_t sum16 = a + res16; int round_up = (((uint32_t) sum16) >> 4) & 1; return (int32_t) ((sum16 / (1 << 5)) + round_up); }
The general version should take the scales of operand representations as additional arguments. This is possible, but it is enough to chew as it is.
All of the above assumes that the fixed-point format for each operand and result is constant. This is a more or less distinctive feature of a fixed point, distinguishing it from floating point formats, on the one hand, and from arbitrarily accurate formats, on the other. However, you have an alternative to resolving formats and tracking them from a separate variable to a value. It would be mostly a hybrid of fixed and arbitrary precision formats, and it would be more random.
In addition, the above suggests that overflow should be excluded at all costs. It is also possible that operands and results will be placed instead on a consistent scale; this would simplify addition and multiplication, and it would provide the possibility of arithmetic overflow. However, this may be acceptable if you have reason to believe that such an overflow is unlikely for your specific data.