Because you are looking for what is the least frequent, and ready to make a rough decision. You can use the Bloom filters series instead of a hash table. If you use large enough, you do not need to worry about the size of the request, since you can probably maintain a false positive rate.
The idea would be to go through all possible request sizes and make substrings from them. For example, if the requests are from 3 to 100, then it will cost (N * (sum (i) from i = 3 to i = 100)). Then one by one add the subsets to one of the flower filters so that the request does not exist in the filter, creating a new Bloom filter with the same hash functions, if necessary. You get an invoice by looking at each filter and checking if a request exists in it. Each request then simply goes through each of the filters and checks if it is there, if it is, it adds 1 to the account.
You need to try to balance the false positive rate as well as the number of filters. If the false positive rate is too high on one of the filters, this is not useful, it is also bad if you have trillions of flowering filters (which is quite possible if you have one filter per substring). There are several ways to solve these problems.
- To reduce the number of filters:
- Optionally remove filters until there are only so many left. This is likely to lead to an increase in false negative rate, which probably means that it is better to simply remove the filters with the highest expected false positive rate.
- Randomly combine filters while there is only so much left. It is ideal to avoid filter fusion too frequent, as it increases the false positive rate. Practically speaking, you probably have too many options for this without using the scalable version (see below), since it will probably be quite difficult to control the false positive rate.
- It may also be bad to avoid the greedy approach when adding flowering to the filter. Be pretty selective where the filter is being added.
You may need to implement scalable flowering filters to keep things manageable, which sounds like what I suggest anyway, so work well.
source share