How can you quickly calculate a large number of logical IOs?

I need to run millions of queries of the following kind.

Each input consists of a small collection (<100) of Boolean vectors of various sizes (<20,000 elements), each of which has several 1 s and many 0s:

A = [ 0 0 0 1 0 0 0 0 0 0 0 ... ] B = [ 0 0 0 0 1 0 ... ] ... 

I also have many (> 20,000) Boolean I-expressions. These expressions are constant for all queries.

 S[1] = A[10] AND B[52] AND F[15] AND U[2] S[2] = I[8] AND Z[4] ... 

Each expression can refer to zero or one element from each vector. Variables rarely appear in multiple expressions. For each query, the output is a set of true expressions.

What is a good algorithm for quickly computing queries, preferably faster than evaluating each expression in order? The algorithm needs to be run once for each input, and there are millions of inputs to run, so speed is important. Since the expressions are constant, I can optimize them in advance. I work with C.

+5
source share
4 answers

You seem to be using a lot of data. This is an assumption, but I would say that you get optimal behavior by pre-processing your expressions in optimal cache passes. Consider the following two expressions:

 S[1] = A[10] AND B[52] AND F[15] AND U[2] S[2] = I[8] AND Z[4] 

rewrite them as:

 S[1] = 1; S[1] &= A[10]; S[1] &= B[52]; S[1] &= F[15]; S[1] &= U[2]; S[2] = 1; S[2] &= I[8]; S[2] &= Z[4]; 

Then sort all the expressions together to create one long list of operations:

 S[1] = 1; S[2] = 1; S[1] &= A[10]; S[1] &= B[52]; S[1] &= F[15]; S[2] &= I[8]; S[1] &= U[2]; S[2] &= Z[4]; 

Consider the size of the machine cache. We need all the input vectors in the cache. This probably cannot be, so we know that we will pull input vectors and result vectors into and out of memory several times. We want to divide the available computer cache into three parts: a vector input block, a vector result block, and some workspace (from which the current current list of operations will be derived).

Now go through the list of expressions pulling expressions that fall into the range of AI and S [1] -S [400]. Then run JT again (or whatever fits in the cache) and pull these operations further as soon as you get to the end of repeating the list of operations for s [401] -s [800]. This is the final order of operations. Please note that this can be parallelized without competition over S-bands.

The downside is that you are not getting early behavior. Growth potential - this is only a cache failure during the transition of blocks of calculations. For such a large dataset, I suspect that this (and eliminating all branching) will overwhelm the early advantage.

If you still want to use early optimization, you can implement it harder. Think about it: as soon as you have the cache of AI and S [1] -s [400], and you have created a list of operations in this bracket:

 S[1] &= A[10]; S[1] &= B[52]; S[1] &= F[15]; S[2] &= I[8]; 

Then you can reorder the operations to group them by S [x] (which was already in this example). Now, if you find that A [10] is false, you can exit early on block S [2]. How to implement this? Well, now your operations should know how much to skip forward from the current operation:

 Operation[x ] => (S[1] &= A[10], on false, x+=3) Operation[x+1] => (S[1] &= B[52], on false, x+=2) Operation[x+2] => (S[1] &= F[15], on false, x+=1) Operation[x+3] => (S[2] &= I[8]... 

Again, I suspect that simply adding branching will be slower than just doing all the other work. This is not a complete process from the very beginning, as you go to the next block of the input block, you will need to review each available S [x] value to determine if it has already been executed and should be skipped.

+4
source

Come back early. Once you find a false boolean, you know that the and expression will return false , so don't check the rest.

In C, you get this default behavior in hard coded expressions:

 (A[10] && B[52] && F[15] && U[2]) 

Depending on how predictable the input is, you can get more performance from ranking each expression variable with the probability of being false and reordering the expression with less probability.

+5
source
  • Transform your inputs into a packed form (list of indexes for non-zero elements). To make the whole approach faster than evaluating each expression in order, you need to process several elements at once using either the twiddling bits built into the compiler (assuming that each input logical value takes only one byte, or even the best one bit).
  • The "AND" preprocesses express the index mapping arrays from the packed input arrays into the expression to which it belongs. (But if a variable appears in more than one expression, it will require special processing).
  • Initialize counters for expressions up to 0.
  • Reading input arrays and increment counts for the corresponding expressions.
  • Expressions having a counter equal to their number are β€œtrue”, others are β€œfalse”.
+2
source

I suggest you pre-process the expressions to create:

  • a list for each variable containing expressions with this variable (ie the list for A10 will be [S1, any other expressions with A10])
  • the counter for each expression of the number of variables in this expression (i.e., the counter for S1 will be 4)

Then for each input:

  • Initialize a counter for each expression for the total number of variables in this expression
  • Iterate over all sparse bits of the set at the input and, for each input, decrease the counter for all expressions containing this bit
  • Returns all expressions in which the counter has reached 0.
+1
source

All Articles