Find the longest bitmap prefix

I am trying to find a fast algorithm that searches for the longest prefix of several bit arrays. In my application, these bitmaps can be infinitely long and of variable length. For example, if I have these bitmaps:

0b1011001 0b1001101 0b1001010 0b1010100 

The longest prefix is โ€‹โ€‹10. I am currently ORing and NANDing bitmaps to find their common 0 and 1 and XORing results together.

 OR 0b1011111 NAND 0b0111111 XOR 0b0011111 

Is there any faster solution?

+4
source share
4 answers

You do not need to ORing or NANDing all arrays (this would be quite expensive since they are arbitrarily long). You can simply scan arrays from left to right when you find the first mismatch . This will be O (kn) , where n is the number of arrays and k is the length of the general prefix.

My python is terrible, so I will only give a very simple example with two arrays of fixed equal length to clear it :

 a = [1,0,1,1,0,0,1] b = [1,0,1,1,0,1,1] for x in xrange(0,7): if a[x] != b[x]: print a[0:x] break 

 output: [1, 0, 1, 1, 0] 

Obviously, you should fix this, I think it will be easy for me if you do not understand the logic of the code.

  • Iterate over x over the xth bit of all arrays until you find a mismatch (i.e. arrays do not have the same bit value) or until the end of the shortest array
  • Print the first x bits of the array.
+1
source

On your approach

It scales well (linearly) in the number of bit arrays.

It does not scale very well by the size of bit arrays; ideally, it should scale depending on the length of the general prefix, and not on the size of the bit arrays.

Low

Bit operations on individual bytes / words from a bitmap should be much faster than going through bits one at a time. (Not sure how low a level of control Python can give you, though).

First sentence

If it were a low-level language such as C, I would solve it the same way as you, but with a few ideas from other answers.

In my example, I will assume that the computer is a 64-bit machine.

I start with (OR NAND XOR) only the first 64 bits of each bit array (these are the basic operations on a 64-bit machine, and complexity is only O (number of arrays)).

Then quickly find the position of the first set of bits as a result (most computers have some quick way of this built-in or at least optimized assembly code, for C , if there is a set bit, return this value.

Otherwise, repeat the next 64-127 bits of each bitmap.

(You will need to deal with bitmaps of different lengths, perhaps by looking for a bitmap of minimum length, and then using this as a maximum.)

The advantage of this approach is that it is linear in the number of bit arrays and the length of the common prefix is โ€‹โ€‹linear.

Second sentence

If the number of bit arrays is large , you can get acceleration using parallelism.

First you can run OR at the same time as NAND.

With more ingenuity, you can apply the following:

If I have 4 bitmaps A, B, C, D

Instead of (((A OR B) OR C) OR D)

I could do (A OR B) OR (C OR D).

In both cases, the same amount of OR is executed.

But the second approach can be effectively parallelized (and, in fact, the second method can be faster on a single-core computer, since the processor will often have more than one ALU).

Writing the logic is a bit more complicated since you cannot use one for loop with one temporary variable holding the result of OR.

You can imagine saving subsamples in an array with a length of one half the number of bit arrays. Save the sub-result A OR B in array [0] and C OR D in array [1], then execute array [0] OR array [1]. (and you can save this result in a new half-length array or overwrite the values โ€‹โ€‹in your array to save space and memory).

Divide the work into large enough pieces to maintain the health of the entire computer without incurring too many extra costs.

With enough processors, you can get close to the log complexity of the number of bit arrays, not linear ones. But in practice, getting 5x acceleration on a 6-core machine is likely to be pretty good.

+1
source

The optimal solution in some cases is to use prefix trees , it has complexity O (n), where n is the sum of the common prefixes of your binary strings, but with a large coefficient.

0
source

Assuming you have input lines s1, s2, s3 ...

  • Let s = commonSubString (s1, s2)
  • For i = 3..n
    • s = commonSubString (s, s2)
  • Return s

In the worst case, s1 and s2 can be the same, in which case you can use heuristics (for example, first limit the initial length s to k = 100. If the final s is still of length k = 100, repeat the whole process, but starting from the location k + 1 for each row).

0
source

All Articles