How to do hypot2 (x, y) calculations when numbers can overflow

I would like to do hypot2 computation on a 16 bit processor.

The standard formula is c = sqrt((a * a) + (b * b)) . The problem with this is with large inputs. For example. 200 and 250, multiply 200 * 200 to get 90,000, which is higher than the maximum signed value of 32,767, so it overflows like b, numbers are added, and the result may also be useless; it may even signal an error due to negative sqrt.

In my case, I am dealing with 32-bit numbers, but 32-bit multiplication on my processor is very fast, about 4 cycles. I am using the dsPIC microcontroller. I would prefer not to multiply with 64-bit numbers, because spending precious memory will certainly be slower. In addition, I only have sqrt for 32-bit numbers, so for 64-bit numbers, another function will be required. So, how can I calculate the hypothesis when the values โ€‹โ€‹can be large?

Please note that I can only use whole math for this. Using some type of floating point math leads to a quick hit, which I would rather avoid. My processor has a fast atan2 integer / fixed point, about 130 cycles; can i use this to calculate the length of the hypotenuse?

+6
math
source share
9 answers

Depending on how much precision you need, you can avoid squares and square root operations. There is a section on this topic in Understanding Digital Signal Processing by Rick Lyons (section 10.2, โ€œHigh-Speed โ€‹โ€‹Vector Approximation,โ€ starting on page 400 in my edition).

Essential approximation:

 magnitude = alpha * min + beta * max 

where max and min are the maximum and minimum absolute values โ€‹โ€‹of the real and imaginary components, and alpha and beta are two constants that are selected to ensure a reasonable distribution of errors in the range of interest. These constants can be represented as fractions with a power of 2 divisors in order to keep arithmetic simple / efficient. In the book, he suggests alpha = 15/16, beta = 15/32, and then you can simplify the formula:

 magnitude = (15 / 16) * (max + min / 2) 

which can be implemented as follows using whole operations:

 magnitude = 15 * (max + min / 2) / 16 

and, of course, we can use shifts for divisions:

 magnitude = (15 * (max + (min >> 1))) >> 4 

The error is +/- 5% in quadrant.

Read more about this method here: http://www.dspguru.com/dsp/tricks/magnitude-estimator

+3
source share

This is taken verbatim from this @John D. Cook blog post , hence CW:

Here's how to calculate sqrt(x*x + y*y) without the risk of overflow.

  • max = maximum(|x|, |y|)
  • min = minimum(|x|, |y|)
  • r = min / max
  • return max*sqrt(1 + r*r)

If @John D. Cook comes and publishes this, you must give him consent :)

+3
source share

Since you essentially cannot do any multiplications without overflow, you are likely to lose some precision.

To get numbers in the valid range, pull out some coefficient x and use

 c = x*sqrt( (a/x)*(a/x) + (b/x)*(b/x) ) 

If x is a common factor, you will not lose accuracy, but if it is not, you will lose accuracy.

Update: Even better, given that you can do a little work with 64-bit numbers, with just one 64-bit add-on, you can do the rest of this problem in 32-bit bits with only a slight loss of accuracy. To do this, do two 32-bit multiplications to give you two 64-bit numbers, add them, and then shift the bit, if necessary, to return the sum up to 32 bits before taking the square root. If you always shift a bit by 2 bits, then simply multiply the final result by 2 ^ (half the number of bit shifts), based on the rule above. Truncation should lead to a very small loss of accuracy, not more than 2 ^ 31 or 0.00000005%.

+2
source share

Aniko and John, it seems to me that you did not address the problem of OP. If a and b are integers, then a * a + b * b is likely to overflow, since whole operations are performed. The obvious solution is to convert the values โ€‹โ€‹of a and b to floating point values โ€‹โ€‹before calculating a * a + b * b. But the OP did not let us know which language we should use, so we got a little stuck.

+1
source share

The standard formula is: c = sqrt ((a * a) + (b * b)). The problem with this is related to large nested inputs.

The solution for overflow (in addition to throwing an error) is to saturate the intermediate computations.

Calculate C = a * a + b * b. If a and b are signed with 16-bit numbers, you will never have an overflow. If they are unsigned numbers, you need to first shift the inputs to get the sum corresponding to the 32-bit number.

If C> (MAX_RADIUS) ^ 2, return MAX_RADIUS, where MAX_RADIUS is the maximum value you can tolerate before overflow detection.

Otherwise, use either sqrt () or CORDIC , which avoids the cost of square roots in favor of loop iteration + adds + shifts to extract the amplitude of the vector (a, b).

+1
source share

If you can limit a and b to no more than 7 bits, you will not get any overflow. You can use the count-leading-zeros command to find out how many bits to throw.

Suppose that a> = b.

 int bits = 16 - count_leading_zeros(a); if (bits > 7) { a >>= bits - 7; b >>= bits - 7; } c = sqrt(a*a + b*b); if (bits > 7) { c <<= bits - 7; } 

Currently, many processors have this instruction, and if not, you can use other quick methods .

Although this will not give you an exact answer, it will be very close (a maximum of ~ 1%).

+1
source share

Do you need complete accuracy? If you do not, you can slightly increase your range by dropping the few least significant bits and then multiplying them.

Can a and b be anything? How about a lookup table if you only have a few a and b that you need to calculate?

0
source share

A simple solution to avoid overflow is to divide a and b by a+b before squaring, and then multiply the square root by a+b . Or do the same with max(a,b) .

0
source share

You can do a little simple algebra to get the results back in range.

 sqrt((a * a) + (b * b)) = 2 * sqrt(((a * a) + (b * b)) / 4) = 2 * sqrt((a * a) / 4 + (b * b) / 4) = 2 * sqrt((a/2 * a/2) + (b/2 * b/2)) 
0
source share

All Articles