Using 10 MB of memory for four billion integers (about detecting optimized block size)

Question

Using 10 MB of memory for four billion integers (about detecting optimized block size)

The problem is that when you enter an input file with an integer of four billion, an algorithm is provided to generate an integer that is not contained in the file, suppose it has only 10 MB of memory.

We looked for some solutions, one of which is storing integers in bit vector blocks (each block representing a certain range of integers in the range of 4 billion, each bit in the block represents an integer) and using a different counter for each block to count the number of integers in each block. Thus, if the number of integers is less than the block capacity for integers, scan the bit vector of the block to find the missing integers.

My confusion in this decision is to mention that the optimal small size is when the array of block counters occupies the same memory as the bit vector. I am confused, why in this situation is the optimal smallest footprint?

Here are the calculation data that I called

Let N = 2^32. counters (bytes): blocks * 4 bit vector (bytes): (N / blocks) / 8 blocks * 4 = (N / blocks) / 8 blocks^2 = N / 32 blocks = sqrt(N/2)/4

thanks in advance Lin

+7

algorithm bit-manipulation bit

Lin ma Oct 10 '15 at 3:46

source share

2 answers

 # assumes all integers are positive and fit into an int # very slow, but definitely uses less than 10MB RAM int generate_unique_integer(file input-file) { int largest=0 while (not eof(input-file)) { i=read(integer) if (i>largest) largest=i } return i++; #larger than the largest integer in the input file }

+1

user5429970 Oct 10 '15 at 4:00

source share

ronalchn · Accepted Answer · 2015-10-10T04:04:41+0000

Why is this the smallest amount of memory:

The proposed solution has two phases:

Count the number of integers in each block
In this case, 4*(#blocks) bytes of memory are used.
Use a bit vector, each bit represents an integer in a block.
In this case, (blocksize/8) bytes of memory are used, which is (N/blocks)/8 .

Setting it to 2 results in blocks = sqrt(N/32) , as you already mentioned.

This is optimal because the required memory is the maximum required memory in each phase (which must be executed). After the 1st stage, you can forget the counters, except for which block to search for phase 2.

Optimization

If your counter is saturated when it reaches capacity, you really don't need 4 bytes per counter, but only 3 bytes. The counter reaches capacity when it exceeds the number of integers in the block.

In this case, phase 1 uses 3*blocks memory, and in phase 2 uses (N/blocks)/8 . Therefore, the optimum is blocks = sqrt(N/24) . If N is 4 billion, the number of blocks is approximately 12910, and the block size is 309838 integers per block. This corresponds to 3 bytes.

Cautions and an alternative with good average business performance.

The algorithm you proposed only works if all input integers are different. If the integers are not different from each other, I suggest you just go to a randomized set of candidates with integers. In a randomized set of candidates with integers, you can select "1000 potential integers" randomly and check if any files are found in the input file. If you fail, you can try finding another random set of integer numbers of candidates. Although this has poor performance in the worst case, in most cases it will be faster in most cases. For example, if input integers have a coverage of 99% of the possible integers, then on average, with 1000 integers, 10 of them will not be found. You can choose the number of candidates in a pseudo-random order so that you never repeat the integer number of candidates, and also to ensure that in a fixed number of attempts you check all possible integers.

If you check the integer numbers of sqrt(N) candidates each time, then the worst case performance may be as good as N*sqrt(N) , because you may have to scan all N integers of sqrt(N) .

You can avoid the worst case if you use this alternative, and if it does not work for the first set of candidate integers, you switch to the proposed solution. This can provide better mid-case performance (this is a common sorting strategy that uses quicksort before switching to heapsort, for example, if it turns out that the worst input is present).

Using 10 MB of memory for four billion integers (about detecting optimized block size)

Why is this the smallest amount of memory:

Optimization

Cautions and an alternative with good average business performance.

More articles: