A simple solution is to use a good hashing algorithm. You can check the range for the hash(seed || x || y) value hash(seed || x || y) .
Of course, choosing points separately with percentage p does not guarantee that you will get a sample whose size will be exactly p * N (That the expected sample size, but any given sample will be slightly different.) If you want to get a sample size exactly k from the object universe N , you can use the following simple algorithm:
Examine the elements in the sample one at a time until k reaches 0.
When analyzing element i add it to the sample if its hash value mapped to the range [0, Ni) is less than k . If you add an element to the pattern, decrease k .
It is impossible to get absolute absolute arithmetic (since there is no way to perfectly divide 2 i different hash values ββinto N buckets if N not a power of 2), so there will always be a tiny offset. (Floating-point arithmetic does not help; the number of possible floating-point values ββis also fixed and suffers from the same offset.)
If you are doing 64-bit arithmetic, the offset will be really tiny, but arithmetic is more complicated if your environment does not provide 128-bit multiplication. Thus, you can be satisfied with 32-bit computing, where the offset of one of the two thousand million [Note 1] does not matter. Here you can use the fact that any 32 bits in your hash should be as objective as any other 32 bits if your hash algorithm is good (see below). So the following check should work fine:
// I need k elements from a remaining universe of n, and I have a 64-bit hash. // Return true if I should select this element bool select(uint32_t n, uint32_t k, uint64_t hash) { return ((hash & (uint32_t)(-1)) * (uint64_t)n) >> 32 < k; } // Untested example sampler // select exactly k elements from U, using a seed value std::vector<E> sample(const std::vector<E>& U, uint64_t seed, uint32_t k) { std::vector<E> retval; uint32_t n = U.size(); for (uint32_t n = U.size(); k && n;) { E& elt = U[--n]; if (select(n, k, hash_function(seed, elt))) { retval.push_back(elt); --k; } } return retval; }
Assuming you need to do this a lot, you will want to use a quick hash algorithm; since you are not really working in a secure environment, you donβt have to worry about the cryptographically safe algorithm.
Many high-speed hashing algorithms work on 64-bit modules, so you can maximize speed by building a 128-bit input consisting of a 64-bit sample and two 32-bit coordinates. Then you can expand the hash loop to execute exactly two blocks.
I am not going to guess the best hash function for your purpose. You might want to test one or more of these open source hash functions:
... and much more.
Notes
- A couple of billion if you're on the other side of the Atlantic.