On your approach
It scales well (linearly) in the number of bit arrays.
It does not scale very well by the size of bit arrays; ideally, it should scale depending on the length of the general prefix, and not on the size of the bit arrays.
Low
Bit operations on individual bytes / words from a bitmap should be much faster than going through bits one at a time. (Not sure how low a level of control Python can give you, though).
First sentence
If it were a low-level language such as C, I would solve it the same way as you, but with a few ideas from other answers.
In my example, I will assume that the computer is a 64-bit machine.
I start with (OR NAND XOR) only the first 64 bits of each bit array (these are the basic operations on a 64-bit machine, and complexity is only O (number of arrays)).
Then quickly find the position of the first set of bits as a result (most computers have some quick way of this built-in or at least optimized assembly code, for C , if there is a set bit, return this value.
Otherwise, repeat the next 64-127 bits of each bitmap.
(You will need to deal with bitmaps of different lengths, perhaps by looking for a bitmap of minimum length, and then using this as a maximum.)
The advantage of this approach is that it is linear in the number of bit arrays and the length of the common prefix is โโlinear.
Second sentence
If the number of bit arrays is large , you can get acceleration using parallelism.
First you can run OR at the same time as NAND.
With more ingenuity, you can apply the following:
If I have 4 bitmaps A, B, C, D
Instead of (((A OR B) OR C) OR D)
I could do (A OR B) OR (C OR D).
In both cases, the same amount of OR is executed.
But the second approach can be effectively parallelized (and, in fact, the second method can be faster on a single-core computer, since the processor will often have more than one ALU).
Writing the logic is a bit more complicated since you cannot use one for loop with one temporary variable holding the result of OR.
You can imagine saving subsamples in an array with a length of one half the number of bit arrays. Save the sub-result A OR B in array [0] and C OR D in array [1], then execute array [0] OR array [1]. (and you can save this result in a new half-length array or overwrite the values โโin your array to save space and memory).
Divide the work into large enough pieces to maintain the health of the entire computer without incurring too many extra costs.
With enough processors, you can get close to the log complexity of the number of bit arrays, not linear ones. But in practice, getting 5x acceleration on a 6-core machine is likely to be pretty good.