Find the Top 10 most visited URLs, data is stored on the network

Source: Google Interview Question

Given the large network of computers, each of which stores the log files of the visited URLs, it finds the ten most visited URLs.

Have a lot of big <string (url) -> int (visits)> maps .

Calculate < string (url) -> int (sum of visits among all distributed maps) and get the top ten on the combined map.

Primary limitation: Maps are too large to transmit over the network. Also cannot use MapReduce directly.

Now I have a lot of questions of this type, where processiong needs to be done on large distributed systems. I cannot think or find a suitable answer.

All I could think of was brute force, which in one way or another violates this restriction.

+7
performance algorithm distributed-system large-data
source share
2 answers

It says that you cannot directly use map binding, which is a hint that the question author wants you to think about how the map works, so we just imitate map-reduce actions

  • preliminary processing: let R be the number of servers in the cluster, give each a unique server identifier from 0,1,2, ..., R-1
  • (map) For each (string, id) - send a tuple to the server with the identifier hash(string) % R
  • (reduce). As soon as step 2 is completed (simple control link), create (string,count) from the top 10 lines to the server. Note that tuples are sent in step 2 to this particular server.
  • (map) Each server will send all its 10-page servers (let it be server 0). This should be good, there are only 10 * R of these records.
  • (decrease) Server 0 will provide the top ten throughout the network.

Notes:

  • The problem with the algorithm, like most big data algorithms that do not use frameworks, is handling failed servers. MapReduce takes care of it for you.
  • The above algorithm can be transferred to a 2-phase algorithm with a decrease in the size of the card is quite straightforward.
+12
source share

In the worst case scenario, any algorithm that does not require the transmission of the entire frequency table will fail. We can create a trivial case when the global top 10 are at the bottom of each individual list of computers.

Assuming that the frequency of the URI complies with the Zipf law, we can find effective solutions. One such solution follows.

Each machine sends top-K items. K depends only on the available bandwidth. One master machine sums the frequencies and finds the 10th maximum frequency response β€œV10” (note that this is the lower limit. Since the global top 10 cannot be in the top K of each machine, the amount is not completed).

In the next step, each computer sends a list of URIs whose frequency is V10 / M (where M is the number of machines). The combination of all such devices is sent back to each machine. Each machine, in turn, sends back the frequency for this particular list. The wizard combines this list into a list from a list.

+3
source share

All Articles