You seem to be using a lot of data. This is an assumption, but I would say that you get optimal behavior by pre-processing your expressions in optimal cache passes. Consider the following two expressions:
S[1] = A[10] AND B[52] AND F[15] AND U[2] S[2] = I[8] AND Z[4]
rewrite them as:
S[1] = 1; S[1] &= A[10]; S[1] &= B[52]; S[1] &= F[15]; S[1] &= U[2]; S[2] = 1; S[2] &= I[8]; S[2] &= Z[4];
Then sort all the expressions together to create one long list of operations:
S[1] = 1; S[2] = 1; S[1] &= A[10]; S[1] &= B[52]; S[1] &= F[15]; S[2] &= I[8]; S[1] &= U[2]; S[2] &= Z[4];
Consider the size of the machine cache. We need all the input vectors in the cache. This probably cannot be, so we know that we will pull input vectors and result vectors into and out of memory several times. We want to divide the available computer cache into three parts: a vector input block, a vector result block, and some workspace (from which the current current list of operations will be derived).
Now go through the list of expressions pulling expressions that fall into the range of AI and S [1] -S [400]. Then run JT again (or whatever fits in the cache) and pull these operations further as soon as you get to the end of repeating the list of operations for s [401] -s [800]. This is the final order of operations. Please note that this can be parallelized without competition over S-bands.
The downside is that you are not getting early behavior. Growth potential - this is only a cache failure during the transition of blocks of calculations. For such a large dataset, I suspect that this (and eliminating all branching) will overwhelm the early advantage.
If you still want to use early optimization, you can implement it harder. Think about it: as soon as you have the cache of AI and S [1] -s [400], and you have created a list of operations in this bracket:
S[1] &= A[10]; S[1] &= B[52]; S[1] &= F[15]; S[2] &= I[8];
Then you can reorder the operations to group them by S [x] (which was already in this example). Now, if you find that A [10] is false, you can exit early on block S [2]. How to implement this? Well, now your operations should know how much to skip forward from the current operation:
Operation[x ] => (S[1] &= A[10], on false, x+=3) Operation[x+1] => (S[1] &= B[52], on false, x+=2) Operation[x+2] => (S[1] &= F[15], on false, x+=1) Operation[x+3] => (S[2] &= I[8]...
Again, I suspect that simply adding branching will be slower than just doing all the other work. This is not a complete process from the very beginning, as you go to the next block of the input block, you will need to review each available S [x] value to determine if it has already been executed and should be skipped.