How to subtract IEEE 754 numbers?

Question

How to subtract IEEE 754 numbers?

For example: 0.546875 - 32.875 ...

-> 0.546875 - 0 01111110 10001100000000000000000 to IEEE-754

-> -32.875 - 1 10000111 01000101111000000000000 in IEEE-754

So how do I do subtraction? I know that I need to make both indicators equal, but what should I do after that? 2 'Add -32.875 mantissas and add with 0.546875 mantissas?

+7

math floating-point ieee-754

Tiago costa Jan 7 '12 at 0:29

source share

1 answer

old_timer · Accepted Answer · 2012-01-07T04:19:13+0000

Actually, this is not what you do with pencil and paper. Ok a little different

123400 - 5432 = 1.234*10^5 - 5.432*10^3

a larger number dominates, slide a smaller amount of mantissa into a bucket of bits until the indicators match

 1.234*10^5 - 0.05432*10^5

then do the subtraction with the mantissa

 1.234 - 0.05432 = 1.17968 1.17968 * 10^5

Then we normalize (which in this case is)

That was with base numbers 10.

In IEEE float, single precision

 123400 = 0x1E208 = 0b11110001000001000 11110001000001000.000...

we normalize that we need to shift the decimal place 16 places to the left so that

 1.1110001000001000 * 2^16

The indicator is biased, so we add 127-16 and get 143 = 0x8F. This is a positive number, so the sign bit is 0, we start building an IEEE floating point number leading 1 before the decimal value is implied and not used in uniform precision, we will get rid of it and save the fraction

sign bit, exponent, mantissa

 0 10001111 1110001000001000... 0100011111110001000001000... 0100 0111 1111 0001 0000 0100 0... 0x47F10400

And if you write a program to find out what the 123400 computer is, you get the same thing:

 0x47F10400 123400.000000

So, we know the exponent and mantissa for the first operand

Now the second operand

 5432 = 0x1538 = 0b0001010100111000

Normalize, shift decimal place by 12 bits

 1010100111000.000 1.010100111000000 * 2^12

The metric is offset add 127 and get 139 = 0x8B = 0b10001011

Put it all together

 0 10001011 010100111000000 010001011010100111000000 0100 0101 1010 1001 1100 0000... 0x45A9C00

And the computer program / compiler gives the same

 0x45A9C000 5432.000000

Now to answer your question. Using the components of floating point numbers, I restored the implied 1 here because we need it

 0 10001111 111100010000010000000000 - 0 10001011 101010011100000000000000

We need to align our decimal places just like in a classroom school, before we can subtract so that in this context you need to shift a smaller number of exhibitors to the right, tossing the mantissa bits from the end until the indicators match

 0 10001111 111100010000010000000000 - 0 10001011 101010011100000000000000 0 10001111 111100010000010000000000 - 0 10001100 010101001110000000000000 0 10001111 111100010000010000000000 - 0 10001101 001010100111000000000000 0 10001111 111100010000010000000000 - 0 10001110 000101010011100000000000 0 10001111 111100010000010000000000 - 0 10001111 000010101001110000000000

Now we can subtract the mantissa. If the signed bits match, we are going to actually subtract, if they do not match, we add. They match, it will be a subtraction.

Computers perform subtraction using additive logic, inverting the second operator along the path to the adder and asserting the carry bit, for example:

  1 111100010000010000000000 + 111101010110001111111111 ==========================

And now, as with paper and pencil, you can add add

  1111000100000111111111111 111100010000010000000000 + 111101010110001111111111 ========================== 111001100110100000000000

or do it with the hex code on your calculator

 111100010000010000000000 = 1111 0001 0000 0100 0000 0000 = 0xF10400 111101010110001111111111 = 1111 0101 0110 0011 1111 1111 = 0xF563FF 0xF10400 + 0xF563FF + 1 = 0x1E66800 1111001100110100000000000 =1 1110 0110 0110 1000 0000 0000 = 0x1E66800

A little about how the equipment works, since it was really a subtraction using an adder, we also invert the executable bit (or on some computers they leave it as it is). So doing 1 is a good thing that we basically discard. If it were zero, we would need more work. We have no execution, so our answer is really 0xE66800.

It allows you to see very quickly that another way, rather than inverting and adding, makes it easy to use the calculator

 111100010000010000000000 - 000010101001110000000000 = 0xF10400 - 0x0A9C00 = 0xE66800

Trying to visualize this, I may have made it worse. The result of subtracting the mantissa is 111001100110100000000000 (0xE66800), there was no movement in the most significant bit, as a result we get a 24-bit number in this case with msbit 1. There is no normalization. To normalize, you need to shift the mantissa left or right until 24 bits align with the most significant 1 in the position that remains in the position itself, adjusting the indicator for each bit shift.

Now, having removed answer 1. bit, we will put the pieces together

 0 10001111 11001100110100000000000 01000111111001100110100000000000 0100 0111 1110 0110 0110 1000 0000 0000 0x47E66800

If you followed by writing a program for this, so did I. This program violates the C standard by using the connection inappropriately. I succeeded with my compiler on my computer, do not expect it to work all the time.

 #include <stdio.h> union { float f; unsigned int u; } myun; int main ( void ) { float a,b,c; a=123400; b= 5432; c=ab; myun.f=a; printf("0x%08X %f\n",myun.u,myun.f); myun.f=b; printf("0x%08X %f\n",myun.u,myun.f); myun.f=c; printf("0x%08X %f\n",myun.u,myun.f); return(0); }

And our result coincides with the output of the above program, we got 0x47E66800, doing it manually

 0x47F10400 123400.000000 0x45A9C000 5432.000000 0x47E66800 117968.000000

If you are writing a program for the synthesis of floating point mathematics, your program can perform subtraction, you do not need to do the inversion and add plus one thing, it complicates it more, as we saw above. If you get a negative result, although you need to play with a sign, invert the result, and then normalize.

So:

1) remove the parts, sign, indicator, mantissa.

2) Align your decimal places, sacrificing the mantissa bits from the number with the lowest exponent, move this mantis to the right until the exponents match

3) is a subtraction operation, if the sign bits match, then you subtract, if the sign bits are different from each other, you perform the addition of mantis.

4) if the result is zero, then your answer is zero, as the result, encode the IEEE value for zero, otherwise:

5) normalize the number, shift the answer to the right or left (the answer can be 25 bits from 24-bit add / subtract, add / subtract can have a sharp shift to normalize, one or more bits to the left) until you have 24 -bit numbers with the most significant remaining justified. 24 bit for single point float. A more correct way to determine normalization is to shift left or right until the number looks like 1. something. if you had 0.001, you would shift left 3, if you had 11.10, you would shift right 1. left shift increases your exponent, right shift decreases it. No different from when we converted from integer to float above.

6) for single precision, remove the leading 1. from the mantissa, if the exponent is full, then you begin to build a signal nano. If the sign bits were different and you did the addition, then you have to deal with calculating the sign bit of the result. If, as mentioned above, everything is fine, you just put the sign of the sign, exponent and mantissa in the result

Multiply and divide the other, you ask about the subtract, so that’s all I covered.

How to subtract IEEE 754 numbers?

More articles: