Source: Google Interview Question
Given the large network of computers, each of which stores the log files of the visited URLs, it finds the ten most visited URLs.
Have a lot of big <string (url) -> int (visits)> maps .
Calculate < string (url) -> int (sum of visits among all distributed maps) and get the top ten on the combined map.
Primary limitation: Maps are too large to transmit over the network. Also cannot use MapReduce directly.
Now I have a lot of questions of this type, where processiong needs to be done on large distributed systems. I cannot think or find a suitable answer.
All I could think of was brute force, which in one way or another violates this restriction.
performance algorithm distributed-system large-data
Spandan
source share