Implementing large-scale log file analytics

Can someone give me a link or provide a high-level overview of how companies, such as Facebook, Yahoo, Google, etc., perform a large-scale (for example, multi-volume range) analysis of the logs that they perform for operations and especially for web analytics?

Focusing on web analytics, in particular, I am interested in two closely related aspects: query performance and data storage.

I know that a general approach is to use map abbreviations to distribute each request across a cluster (e.g. using Hadoop). However, what is the most efficient storage format to use? This is log data, so we can assume that each event has a timestamp, and in general the data is structured and not sparse. Most web analytics queries include analyzing data slices between two arbitrary timestamps and extracting statistics or anomalies from that data.

Will there be an efficient way to store and, more importantly, query for such data in a column-oriented database, such as a Big Table (or HBase)? Does the fact that you select a subset of strings (based on a timestamp) work against the basic premise of this type of storage? It would be better to save it as unstructured data, for example. reverse index?

+4
source share
3 answers

Unfortunately, no size is suitable for all answers.

I currently use Cascading, Hadoop, S3, and Aster Data to process 100 gigs a day through a phased pipeline inside AWS.

Aster data is used for queries and reports because it provides an SQL interface for massive datasets that are cleaned and analyzed by cascading Hadoop processes. Using Cascading JDBC interfaces, loading Aster data is a pretty trivial process.

Remember that tools such as HBase and Hypertable are Key / Value repositories, so do not make ad-hoc requests and joins without using MapReduce / Cascading to perform out-of-range joins, which is a very useful pattern.

in full disclosure, I am the developer of the Cascading project.

http://www.asterdata.com/

http://www.cascading.org/

+5
source

Hadoop Book: O'Reilly’s final guide has a chapter that discusses how chaos is used in two real companies.

http://my.safaribooksonline.com/9780596521974/ch14

+5
source

Take a look at the Google article . Data Interpretation: Parallel Analysis with Sawzall . This is a document about the tool that Google uses to analyze the magazine.

+4
source

All Articles