Data structure with O (1) input time and O (log m) search?

Backstory (passage to the second paragraph for a piece of data): I am working on a compression algorithm (a variation of LZ77). The algorithm comes down to finding the longest match between a given line and all the lines that have already been noticed.

To do this quickly, I used a hash table (with a separate chain) as recommended in the DEFLATE specification : I insert every row (one for each byte of input) with m slots in the chain for each hash code. The inserts are fast (constant time without conditional logic), but the search is slow because I have to search for O (m) strings to find the longest match. Since in a typical example I do hundreds of thousands of inserts and tens of thousands of searches, I need a highly efficient data structure if I want my algorithm to work fast (currently it is too slow for m> 4, I need m closer to 128).

I implemented a special case when m is 1, which works very quickly, but offers only the so-called compression. Now I'm working on an algorithm for those who prefer an improved compression ratio over speed, where more than m, the better compression (to a point, obviously). Unfortunately, my attempts are still too slow for moderate compression ratios with increasing m.

So, I am looking for a data structure that allows you to insert very quickly (since I am doing more attachments than searching), but still a fairly quick search (better than O (m)). Is there an O (1) and O (log m) search structure? Otherwise, what would be the best data structure? I am ready to sacrifice memory for speed. I have to add that on my target platform, jumps (ifs, loop and function calls) are very slow, like heap allocations (I have to implement everything myself using a raw byte array in order to get acceptable performance).

So far, I have been thinking about storing m strings in sorted order, which will allow O (log m) to be searched using binary search, but then the inserts also become O (log m).

Thanks!

+7
source share
3 answers

You may be interested in this search structure:

http://encode.ru/threads/1393-A-proposed-new-fast-match-searching-structure

This is O (1) insertion time and O (m) search. But (m) is many times lower than the standard hash table for the equivalent match result. For example, with m = 4, this structure gets equivalent results than the hash table of 80 probes.

+3
source

You might want to use trie (aa prefix tree) instead of a hash table.

For your specific application, you can further optimize the insertion. If you know that after pasting ABC you are likely to enter ABCD , then save the link to the entry created for ABC and just continue it with D --- no need to repeat the search for the prefix.

+1
source

One common optimization in hash tables is to move the item you just found to the head of the list (with the idea that it will most likely be used again for this bucket). Perhaps you can use a variation of this idea.

If you do all your inserts before doing a search, you can add a bit to each bucket that says whether the chain is sorted for that bucket. With each search, you can check the bit to see if the bucket is sorted. If not, you sort the bucket and set the bit. After sorting the bucket, each search is O (log m).

If you alternate between inserts and searches, you can have 2 lists for each bucket: one that is sorted, but not on it. Insertions will always be displayed in an unsorted list. First, the search checks the sorted list, and only if it does not look in an unordered list. When it is found in an unordered list, you delete it and put it in a sorted list. This way you only pay for sorting the items you are viewing.

0
source

All Articles