Can someone give me a link or provide a high-level overview of how companies, such as Facebook, Yahoo, Google, etc., perform a large-scale (for example, multi-volume range) analysis of the logs that they perform for operations and especially for web analytics?
Focusing on web analytics, in particular, I am interested in two closely related aspects: query performance and data storage.
I know that a general approach is to use map abbreviations to distribute each request across a cluster (e.g. using Hadoop). However, what is the most efficient storage format to use? This is log data, so we can assume that each event has a timestamp, and in general the data is structured and not sparse. Most web analytics queries include analyzing data slices between two arbitrary timestamps and extracting statistics or anomalies from that data.
Will there be an efficient way to store and, more importantly, query for such data in a column-oriented database, such as a Big Table (or HBase)? Does the fact that you select a subset of strings (based on a timestamp) work against the basic premise of this type of storage? It would be better to save it as unstructured data, for example. reverse index?
source share