Find the URL of the top k visits in the last day, or last hour, or last minute?

The original question is a file containing 5 GB of URLs visited on the last day to search for top-level URLs. The problem can be solved using a hash map to count the occurrences of different URLs and search for the vertex k using the min heap, taking into account the time O (n log k).

Now I think that if the input was an unlimited stream of online data (instead of a static file), then how to find the top k-address of the last day?

Or are there any improvements I can make for the system that allow me to get the maximum URL for the last minute and last day and last hours dynamically?

Any hint would be appreciated!

+6
source share
1 answer

If you are willing to agree to a probabilistic answer, which may contain several incorrect entries, you should definitely look at the count-min sketch . It was specifically designed to evaluate frequent elements in a stream using as little memory as possible, and most implementations support very efficient time and space for efficient use of the top k elements from the stream. In addition, the structure allows you to customize the use of space, which makes it ideal for such situations. IIRC Google uses this to determine the most common searches.

several implementations of this data structure are available on the Internet.

Hope this helps!

+1
source

All Articles