Search for a missing number using binary search

I am reading a book about pearl programming.

Question: Given a sequential file that contains no more than four billion 32-bit integers in random order, find a 32-bit integer that is not in the file (and there must be at least one missing). This problem should be solved if we have several hundred bytes of main memory and several consecutive files.

Decision. To establish this as a binary search, we need to define a range, a representation for the elements within the range, and probing to determine which half of the range contains the missing integer. How do we do this?

We will use as a range a sequence of integers that are known to contain at least one missing element, and we will represent a range by file containing all the integers in it. The understanding is that we can by counting the elements above and below its middle: either the upper or lower range has almost half the elements in the total assortment. Since the overall range has a missing element, a smaller half should also have a proportion element. This is the majority of the ingredients of a binary search algorithm for the indicated problem.

Above the text is a copy of John Bant's right from the pearl programming book.

Some information is provided at the following link.

"Pearl programming" binary search help

How do we do a passing search using binary search, and also do not follow the example in the link above? Please help me understand the logic in just 5 integers, not millions of integers to understand the logic.

+6
source share
5 answers

Why don't you re-read the answer in the message "Programming Pearls" binary search . He explains the process of 5 integers as you ask.
The idea is that you analyze each list and break it into 2 (where the binary part comes from) individual lists based on the value in the first bit.

those. display of the binary representation of the actual numbers Original list "": 001, 010, 110, 000, 100, 011, 101 => (broken into)
(we remove the first bit and add it to the "name" of the new list)
To form each of the lists below, we took values ​​starting from [0 or 1] from the list above
List " 0 ": 01, 10, 00, 11 (formed from a subset 001, 010, 000, 011 of list "", deleting the first bit and adding a new list to the "name")
List " 1 ": 10, 00, 01 (formed from a subset of 110, 100, 101 of list "", removing the first bit and adding it to the "name" of the new list)

Now take one of the resulting lists in turn and repeat the process:
List " 0 " becomes your original list, and you break it into
List "0 *** 0 **" and
List "0 *** 1 **" (bold numbers again are 1 [remaining] number of numbers in the broken list)

Continue until you are done with an empty list.

EDIT
The process step by step:
List ": 001, 010, 110, 000, 100, 011, 101 =>
List "0": 01, 10, 00, 11 (from a subset 001, 010, 000, 011 of the list "") =>
List "00": 1, 0 (from a subset 01, 00 of list "0") =>
List "000": 0 [end result] (from subset 0 of list "00")
List "001": 1 [end result] (from subset 1 of list "00")
List "01": 0, 1 (from subset 10, 11 of list "0") =>
List "010": 0 [end result] (from the subset 0 of list "01")
List "011": 1 [end result] (from subset 1 of list "01")
List "1": 10, 00, 01 (from a subset of 110, 100, 101 of list "") =>
List "10": 0, 1 (from subset 00, 01 of list "1") =>
List "100": 0 [end result] (from a subset of 0 of list "10")
List "101": 1 [end result] (from subset 1 of list "10")
List "11": 0 (from a subset of 10 from list "1") =>
List "110": 0 [end result] (from subset 0 of list "11")
List "111": missing [end result] (from a subset of the EMPTY list "11")

The positive thing about this method is that it allows you to find ANY number of missing numbers in the set - i.e. if more than one is missing.

PS AFAIR for 1 single missing number from the full range there is an even more elegant XOR solution for all numbers.

+3
source

The idea is to solve an easier task:

Invalid value in the range [minVal, X] or (X, maxVal). If you know this, you can move X and check again.

For example, you have 3, 4, 1, 5 (2 missing). You know that minVal = 1, maxVal = 5.

  • Range = [1, 5], X = 3, there must be 3 integers in the range [1, 3] and 2 in the range [4, 5]. There are only 2 in the range [1, 3], so you are looking in the range [1, 3].
  • Range = [1, 3], X = 2. There is only 1 value in the range [1, 2], so you look in the range [1, 2]
  • Range = [1, 2], X = 1. There are no values ​​in the range [2, 2], so this is your answer.

EDIT: some pseudo-C ++ code:

minVal = 1, maxVal = 5; //choose correct values while(minVal < maxVal){ int X = (minVal + maxVal) / 2 int leftNumber = how much in range [minVal, X] int rightNumber = how much in range [X + 1, maxVal] if(leftNumber < (X - minVal + 1))maxVal = X else minVal = X + 1 } 
+1
source

Here's a simple C solution that should illustrate the technique. To abstract away any tedious file I / O data, I assume the existence of the following three functions:

  • unsigned long next_number (void) reads the number from the file and returns it. When you call again, the next number in the file is returned, etc. The behavior when the end of the file is encountered is undefined.

  • int numbers_left (void) returns true if the number available for reading with next_number() , false if the end of the file is reached.

  • void return_to_start (void) rewinds the reading position to the beginning of the file, so the next call to next_number() returns the first number in the file.

I also assume that the unsigned long has a width of at least 32 bits, as required to comply with ANSI C implementations; modern C programmers may prefer to use stdint.h instead of uint32_t .

Given these assumptions, here is the solution:

 unsigned long count_numbers_in_range (unsigned long min, unsigned long max) { unsigned long count = 0; return_to_start(); while ( numbers_left() ) { unsigned long num = next_number(); if ( num >= min && num <= max ) { count++; } } return count; } unsigned long find_missing_number (void) { unsigned long min = 0, max = 0xFFFFFFFF; while ( min < max ) { unsigned long midpoint = min + (max - min) / 2; unsigned long count = count_numbers_in_range( min, midpoint ); if ( count < midpoint - min + 1 ) { max = midpoint; // at least one missing number below midpoint } else { min = midpoint; // no missing numbers below midpoint, must be above } } return min; } 

It should be noted that min + (max - min) / 2 is a safe way to calculate the average of min and max ; it will not give dummy results due to overflow of intermediate values, such as a seemingly simpler (min + max) / 2 .

In addition, although it would be tempting to solve this problem using recursion, I chose an iterative solution instead for two reasons: firstly, because it (perhaps) shows more clearly what is actually being done, and secondly, because the goal was to minimize memory usage, which supposedly also includes a stack.

Finally, it would be easy to optimize this code, for example. returning as soon as count is zero, counting the numbers in both halves of the range in one pass and choosing one with more missing numbers or even expanding the binary search to n-ary search for some n> 2 to reduce the number of passes. However, to keep the example code as simple as possible, I left such optimizations unchanged. If you like, you can, say, try changing the code so that it takes no more than eight passes over the file instead of the current 32. (Hint: use a 16-element array.)

+1
source

Actually, if we have a range of integers from a to b. Example: [a..b]. And in this range we have b-integers. This means that only one is missing. And if only one is missing, we can calculate the result using only one cycle. First, we can calculate the sum of all integers in the range [a..b], which is equal to:

 sum = (a + b) * (b - a + 1) / 2 

Then we calculate the summation of all integers in our sequence:

 long sum1 = 0; for (int i = 0; i < b - a; i++) sum1 += arr[i]; 

Then we can find the missing element as the difference of these two sums:

long result = sum1 - sum;

0
source

when you saw 2 ^ 31 zeros or ones in the i-th digit, then your answer has one or zero in the i-th place. (Example: 2 ^ 31 in the fifth binary position means that the answer is zero in the fifth binary position.

First draft c code:

 uint32_t binaryHistogram[32], *list4BILLION, answer, placesChecked[32]; uint64_t limit = 4294967296; uint32_t halfLimit = 4294967296/2; int i, j, done //General method to point to list since this detail is not important to the question. list4BILLION = 0000000000h; //Initialize array to zero. This array represents the number of 1s seen as you parse through the list for(i=0;i<limit;i++) { binaryHistogram[i] = 0; } //Only sum up for first half of the 4 billion numbers for(i=0;i<halfLimit;i++) { for(j=0;j<32;j++) { binaryHistogram[j] += ((*list4BILLION) >> j); } } //Check each ith digit to see if all halfLimit values have been parsed for(i=halfLimit;i<limit;i++) { for(j=0;j<32;j++) { done = 1; //Dont need to continue to the end if placesChecked are all if(placesChecked[j] != 0) //Dont need to pass through the whole list { done = 0; // binaryHistogram[j] += ((*list4BILLION) >> j); if((binaryHistogram[j] > halfLimit)||(i - binaryHistogram[j] == halfLimit)) { answer += (1 << j); placesChecked[j] = 1; } } } } 
0
source

Source: https://habr.com/ru/post/924834/


All Articles