How many times does a line appear on another line

I have a large static binary (10 GB) that does not change.

I want to be able to take as input small strings (15 bytes or below each), and then determine which string is the least frequent.

I understand that without actually looking for the entire binary, I cannot pinpoint this, so I know it will be an approximation.

Building a tree / hash table is not possible, since it requires about 256 ^ 15 bytes, which is ALOT.

I have about 100 GB of disk space and 8 GB of RAM that will be dedicated to this task, but I can’t find a way to complete this task without going through the file.

I have as much time as I want to prepare a large binary, and after that I will need to decide which of the least frequently repeated lines many times.

Any ideas?

Thanks! Daniel

(By the way: if this is important, I use Python)

+4
source share
4 answers

Maybe build a hash table with calculations for the number of n-tuples, how much can you afford? You can trim trees that are no longer displayed. I would not call it "approximation", but it could be an "upper limit", with the guarantee of finding rows that are not displayed.

So, let's say you can build all 4-tuples.

Then to count the inputs for "ABCDEF" you will have a minimum counter (ABCD), counter (BCDE), counter (CDEF). If this value is zero for any of them, the line will not be displayed. If it is one, it will be displayed no more than once (but perhaps not at all).

+1
source

Since you have a large static string that does not change, you can distinguish one-time working preprocessing of this string, which should never be repeated from the operation of the responding requests. It might be convenient to do one-time work on a more powerful machine.

If you can find a machine with an order or a large number of internal storages, you can build a suffix array - an array of offsets in the stream in a sorted order of suffixes starting with the offset. This can be stored in external storage for queries, and you can use it with binary search to find the first and last positions in the sorted order where the query string appears. Obviously, the distance between them will give you the number of occurrences, and for a binary search it will take about 34 binary beats to perform 16 GB if 16 GB is 2 ^ 34 bytes, so each request should cost about 68 disk accesses.

It may not be reasonable to expect that you find this amount of internal storage, but I just bought a 1 TB hard drive for 50 pounds, so I think you could increase the external drive in one time. There are algorithms for constructing an array of suffixes in external memory, but since query strings are limited to 15 bytes, you do not need anything complicated. Just create 200 GB of data by writing out the 15-byte string found at each offset, followed by the 5-byte offset number, then sort those 20-byte records with external sorting. This will give you 50 gigabytes of indexes per row in sorted order so that you can put in the external storage responses to queries using.

0
source

If you know all the requests in advance or are ready to download them, another approach would be to create a http://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_string_matching_algorithm tree from them. This takes time, linear in the total size of the requests. Then you can transfer data over 10 GB in time in proportion to the sum of the size of this data and the number of times when any row finds a match.

0
source

Because you are looking for what is the least frequent, and ready to make a rough decision. You can use the Bloom filters series instead of a hash table. If you use large enough, you do not need to worry about the size of the request, since you can probably maintain a false positive rate.

The idea would be to go through all possible request sizes and make substrings from them. For example, if the requests are from 3 to 100, then it will cost (N * (sum (i) from i = 3 to i = 100)). Then one by one add the subsets to one of the flower filters so that the request does not exist in the filter, creating a new Bloom filter with the same hash functions, if necessary. You get an invoice by looking at each filter and checking if a request exists in it. Each request then simply goes through each of the filters and checks if it is there, if it is, it adds 1 to the account.

You need to try to balance the false positive rate as well as the number of filters. If the false positive rate is too high on one of the filters, this is not useful, it is also bad if you have trillions of flowering filters (which is quite possible if you have one filter per substring). There are several ways to solve these problems.

  • To reduce the number of filters:
    • Optionally remove filters until there are only so many left. This is likely to lead to an increase in false negative rate, which probably means that it is better to simply remove the filters with the highest expected false positive rate.
    • Randomly combine filters while there is only so much left. It is ideal to avoid filter fusion too frequent, as it increases the false positive rate. Practically speaking, you probably have too many options for this without using the scalable version (see below), since it will probably be quite difficult to control the false positive rate.
    • It may also be bad to avoid the greedy approach when adding flowering to the filter. Be pretty selective where the filter is being added.

You may need to implement scalable flowering filters to keep things manageable, which sounds like what I suggest anyway, so work well.

0
source

All Articles