Shuffling a huge number of numbers with minimal storage

I have a very large range / set of numbers, (1..1236401668096) , which basically I would like to "shuffle", i.e. move randomly without re-viewing the same number. I will start the web service, and each time a request arrives, it will increment the counter and pull the next "shuffled" number out of range. The algorithm should adapt to the fact that the server goes offline, having the ability to restart the crawl using the saved counter value (something like how you can sow a pseudo-random number generator and get the same pseudo-random number taking into account seeds and iteration, where you are).

I am wondering if such an algorithm exists or exists. I saw the Fisher-Yates Shuffle , but the first step is to "Write down the numbers 1 to N," which will require a terabyte to store my entire assortment. Generating a pseudo-random number for each query may work for a while, but as the database / tree populates, collisions will become more common and may degrade performance (there is already a probability of 0.08% collision after 1 billion views according to my calculations). Is there a better solution for my scenario, or is it just a dream?

The reason for the shuffling is that the ability to correctly guess the next number in the sequence can lead to a minor DOS vulnerability in my application, but also because the presentation level will be much better with a wider distribution of numbers (I'd rather not go into details about what the application does). At the moment I'm considering using PRNG and handling collisions or shuffling range ranges (starting with (1..10000000).to_a.shuffle , then (10000001, 20000000).to_a.shuffle , etc., since each range number begins to end).

Do any mathematicians have any better ideas / suggestions?

+4
source share
3 answers

Concatenate a PRNG or LFSR Sequence Using the /dev/random Bit

There are several algorithms that can generate pseudorandom numbers with arbitrarily large and known periods. The two obvious candidates are LCPRNG (LCG) and LFSR, but there are more algorithms like Mersenne Twister.

The period of these generators can be easily built according to your requirements, and then you simply will not have collisions.

You can handle the predictable behavior of PRNG and LFSR by adding 10, 20, or 30 bits of cryptographically hashed entropy from an interface such as /dev/random. . Since the deterministic part of your number is known to be unique, it does not matter if you ever repeat the actually random part of it.

+1
source

Divide and win? Break into manageable pieces and shuffle them. You can split the range of numbers, for example. in their magnitude modulo n. The list is constructive and rather small depending on n. When the group is exhausted, you can use the following.

For example, if you select n from 1000, you will create 1000 different groups. Choose a random number from 1 to 1000 (call this x) and shuffle the numbers whose value modulo 1000 is x. After you have exhausted this range, you can select a new random number from 1 to 1000 (without x, obviously) to get the next subset to shuffle. It shouldn't be difficult to keep track of which numbers in the 1..1000 range have already been used, so you only need a repeatable shuffle algorithm for the numbers in the subset (for example, Fisher-Yates by its "indexes" ").

+1
source

I think the best option is to use a GUID / UUID . They are made for this type of thing, and it’s easy to find an existing implementation to suit your needs.

While collisions are theoretically possible, they are extremely unlikely. To quote Wikipedia:

The probability that one duplicate will be about 50% if every person on earth owns 600 million UUIDs

0
source

All Articles