What is the most efficient bit vector compression method for my use case?


I am working on a project in computational biology, and I need to keep the locuses index, which differs between many sequences. Right now I'm using B + Tree for this purpose, but I suppose using the bitmap index would be faster for that use: only a small number of loci differ between the two sequences, on average 1%, and they are almost equally distributed along the sequence; therefore, it seems that there is a lot of room for raster index compression. My problem is that I cannot find a compression method that can efficiently:

  • allows you to quickly set / disable individual bits.
  • allow efficient range queries on a bitmap
  • possibly allows you to quickly perform XOR-ing / AND from two indexes

Thanks in advance for your suggestions.

+8
c bit-manipulation indexing bitmap compression
source share
2 answers
+2
source share

You can use a simple tree data structure, for example:

struct node { node * leftChild; node * rightChild; long mask; }; struct tree { int exponent; // the size of the tree is 2^exponent node rootNode; }; 

Each node is a submatrix of a large array of bits, which is (2 ^ n) * sizeof (long) bits, n> = 0. Leaf nodes store the raw bitmask in the mask if they are at the bottom of the tree, otherwise they save 0 in the mask . Thus, a node sheet with a “mask” value of 0 can represent a (2 ^ n) * sizeof (long) -dimensional empty area in the bitmap, so sparse bit arrays can be effectively stored.

leftChild and rightChild, of course, are null in all leaf nodes. Each other node has both a leftChild pointer and a rightChild pointer, and each node that is not a leaf node has at least one descendant node with a mask with the bits set in it.

To find a bit at a given index:

 bool find_bit_at_index(tree t, long ind) { long divider = 1 << (t.exponent - 1); node *n = &t.rootNode; node *lastNode; while (n) { lastNode = n; if (ind >= divider) { n = n->rightChild; ind -= divider; } else { n = n->leftChild; } divider >>= 1; } return lastNode->mask & (1 << ind); } 

Building a tree and developing the rest of the algorithms should be fairly simple as soon as you understand the idea. I really have not tested the code, as this is not a complete solution, some typos or some may remain. And I'm not a raster image index specialist, maybe (probably there is) a ready-made package that does it better, but this solution is simple and should be relatively effective. 1% is probably not yet sparse enough to do it better than a simple array of bits (assuming longs stores 64 bits each, no more than two lengths are required so that on average more than one bit is set), but if it decreases that space and time will be reduced.

0
source share

Source: https://habr.com/ru/post/651375/


All Articles