Deterministic scrambling bit for filtering coordinates

I am trying to write a function which, given a (x, y) coordinate pair and random program seed, will psuedo-randomly return true for a given percentage of all such pairs. There are no restrictions on x or y beyond the limits of the data type, which is a 32-bit signed int.

My current approach is to mash the x, y bit and seed together, and then compare the resulting number with a percentage:

float percentage = 0.005; ... unsigned int n = (x ^ y) ^ seed; return (((float) n / UINT_MAX) < percentage); 

However, it seems that this approach would be biased for specific values ​​of x and y. For example, if it returns true for (0, a), it will also return true for (a, 0).

I know that this implementation, which simply XORs them together, is naive. Is there a better scrambling algorithm to be used here that will not be biased?

Edit: To clarify, I am not starting with a coordinate set (x, y), and not trying to get a fixed-size coordinate set, which evaluates to true. The function should be able to evaluate the truth value for arbitrary x, y and seed with a percentage of the average frequency of the "true" coordinates.

+8
c algorithm random
source share
2 answers

A simple solution is to use a good hashing algorithm. You can check the range for the hash(seed || x || y) value hash(seed || x || y) .

Of course, choosing points separately with percentage p does not guarantee that you will get a sample whose size will be exactly p * N (That the expected sample size, but any given sample will be slightly different.) If you want to get a sample size exactly k from the object universe N , you can use the following simple algorithm:

  • Examine the elements in the sample one at a time until k reaches 0.

  • When analyzing element i add it to the sample if its hash value mapped to the range [0, Ni) is less than k . If you add an element to the pattern, decrease k .

It is impossible to get absolute absolute arithmetic (since there is no way to perfectly divide 2 i different hash values ​​into N buckets if N not a power of 2), so there will always be a tiny offset. (Floating-point arithmetic does not help; the number of possible floating-point values ​​is also fixed and suffers from the same offset.)

If you are doing 64-bit arithmetic, the offset will be really tiny, but arithmetic is more complicated if your environment does not provide 128-bit multiplication. Thus, you can be satisfied with 32-bit computing, where the offset of one of the two thousand million [Note 1] does not matter. Here you can use the fact that any 32 bits in your hash should be as objective as any other 32 bits if your hash algorithm is good (see below). So the following check should work fine:

 // I need k elements from a remaining universe of n, and I have a 64-bit hash. // Return true if I should select this element bool select(uint32_t n, uint32_t k, uint64_t hash) { return ((hash & (uint32_t)(-1)) * (uint64_t)n) >> 32 < k; } // Untested example sampler // select exactly k elements from U, using a seed value std::vector<E> sample(const std::vector<E>& U, uint64_t seed, uint32_t k) { std::vector<E> retval; uint32_t n = U.size(); for (uint32_t n = U.size(); k && n;) { E& elt = U[--n]; if (select(n, k, hash_function(seed, elt))) { retval.push_back(elt); --k; } } return retval; } 

Assuming you need to do this a lot, you will want to use a quick hash algorithm; since you are not really working in a secure environment, you don’t have to worry about the cryptographically safe algorithm.

Many high-speed hashing algorithms work on 64-bit modules, so you can maximize speed by building a 128-bit input consisting of a 64-bit sample and two 32-bit coordinates. Then you can expand the hash loop to execute exactly two blocks.

I am not going to guess the best hash function for your purpose. You might want to test one or more of these open source hash functions:

... and much more.


Notes

  • A couple of billion if you're on the other side of the Atlantic.
+1
source share

I would rather feed the seeds, x and y through the Combined Linear Congruent Generator .

This, as a rule, is much faster than hashing, and it is specifically designed for this purpose: uniformly output a pseudo-random number in a certain range.

Using the coefficients recommended by Wichmann-Hill (which are also used in some versions of Microsoft Excel), we can do:

 si = 171 * s % 30269; xi = 172 * x % 30307; yi = 170 * y % 30323; r_combined = fmod(si/30269. + xi/30307. + yi/30323., 1.); return r_combined < percentage; 

Where s is the seed on the first call and the previous si for each subsequent call. (Thanks for the rici comment for this point.)

+1
source share

All Articles