Implement floating point subtraction

all I'm trying to implement is a floating point arithmetic library, and it's hard for me to understand the algorithm for subtracting floats. I successfully implemented the add-on, and I thought that subtraction is just a special case, but it seems like I'm doing something. I am adding the code here for reference only, it has many self-evident functions, but I do not expect anyone to understand this 100%. I would like to help with this algorithm. We follow the same method as when adding floating point numbers, in addition, when we add the mantissa, we convert the negative (the one that we subtract) into two additions, and then add them?

What I am doing, but the result is incorrect. Although it is very close ... but not the same. Does anyone have any idea? Thanks in advance!

I am absolutely sure that the way I do what works is that I have implemented an almost identical float adding algorithm and it works like a charm.

_float subFloat(_float f1,_float f2) { unsigned char diff; _float result; //first see whose exponent is greater if(f1.float_parts.exponent > f2.float_parts.exponent) { diff = f1.float_parts.exponent - f2.float_parts.exponent; //now shift f2 mantissa by the difference of their exponent to the right //adding the hidden bit f2.float_parts.mantissa = ((f2.float_parts.mantissa)>>1) | (0x01<<22); f2.float_parts.mantissa >>= (int)(diff);//was (diff-1) //also increase its exponent by the difference shifted f2.float_parts.exponent = f2.float_parts.exponent + diff; } else if(f1.float_parts.exponent < f2.float_parts.exponent) { diff = f2.float_parts.exponent - f1.float_parts.exponent; result = f1; f1 = f2; //swap them f2 = result; //now shift f2 mantissa by the difference of their exponent to the right //adding the hidden bit f2.float_parts.mantissa = ((f2.float_parts.mantissa)>>1) | (0x01<<22); f2.float_parts.mantissa >>= (int)(diff); //also increase its exponent by the difference shifted f2.float_parts.exponent = f2.float_parts.exponent + diff; } else//if the exponents were equal f2.float_parts.mantissa = ((f2.float_parts.mantissa)>>1) | (0x01<<22); //bring out the hidden bit //getting two complement of f2 mantissa f2.float_parts.mantissa ^= 0x7FFFFF; f2.float_parts.mantissa += 0x01; result.float_parts.exponent = f1.float_parts.exponent; result.float_parts.mantissa = (f1.float_parts.mantissa +f2.float_parts.mantissa)>>1; //gotta shift right by overflow bits //normalization if(manBitSet(result,1)) result.float_parts.mantissa <<= 1; //hide the hidden bit else result.float_parts.exponent +=1; return result; } 
+4
source share
2 answers

If your add code is correct and there is no subtraction, the problem seems to consist of two additions and additions.

Do I need to make two additions and addition, and not subtraction?

If this is not a problem, I am having problems with your algorithm. It has been a while since I did something like that. Could you provide some details? More specifically, what is a hidden bit?

It seems to me that processing a hidden bit is correct for adding, but not for subtracting. Maybe you should set it to f1 mantissa, not f2? Or deny f1 mantissa instead of f2?

Not knowing what you are getting against what you expect, and in more detail about the algorithm you are using, is the best I can do.

Edit: Ok, I looked at the links in your comment. One thing you do not do in the supplied code is normalization. When adding either an overflow of hidden bits (left shift of the mantissa, an increment indicator), or not. When subtracting, the arbitrary parts of the mantissa can be equal to zero. In decimal form, consider adding 0.5E1 and 0.50001E1; you will get 1.00001E1, and if you normalized, you would get 0.10001E2. Subtracting 0.5E1 from 0.50001E1 you get 0.00001E1. Then you need to move the mantissa to the left and reduce the exponent by as much as it takes to get 0.1E-4.

+1
source

ab == a+(-b) , and the unary minus is trivial, so I won’t even worry about the binary minus.

+2
source

All Articles