Amazon MapReduce Best Log Analysis Methods

I parse the access logs created by Apache, Nginx, Darwin (video streaming server) and aggregate statistics for each delivered file by date / referrer / useragent.

Tons of logs are generated every hour, and this number can be significantly increased in the near future - therefore, processing such data in a distributed way through Amazon Elastic MapReduce sounds reasonable.

Now I am ready with the help of maps and reducers to process my data and have tested the whole process with the following thread:

  • loaded maps, gears and data in Amazon S3
  • set up the appropriate task and successfully processed it
  • uploaded the aggregated results from Amazon S3 to my server and injected them into the MySQL database by running the CLI script

I did it manually according to the thousands of tutorials you can find on the Internet about Amazon ERM.

What should I do next? What is the best approach to automate this process?

What are common practices for:

  • Using cron to control Amazon EMR jobTracker via API?
  • How can I make sure my logs will not be processed twice?
  • Should I control the movement / deletion of processed / result files using my own script?
  • What is the best approach to processing results for inserting them into PostgreSQL / MySQL?
  • Should I create different "input" / "output" directories for each job or use the same directories for all jobs?
  • Should I create a new task every time through the API?
  • What is best for uploading raw magazines to Amazon S3? I looked at Apache Flume , but I'm not sure if this is what I need, as long as I don't need real-time processing logs.
  • How do you control this new piece of logs from Apache, are nginx ready to be uploaded to Amazon? (log rotation?)
  • Can anyone share their data flow setup?
  • How do you control file downloads and shutdowns?

Of course, in most cases it depends on your infrastructure and application architecture.

Of course . I can implement this with my custom solution, perhaps by re-investing a lot of things that are already in use by someone else.

But there should be some kind of common practice that I would like to familiarize myself with.

I think this topic can be useful for many people who are trying to process access logs using Amazon Elastic MapReduce, but have not been able to find good materials about the best methods for handling this.

UPD: Just for clarification, the question is:

What are the best Amazon Elastic MapReduce log processing methods?

Related posts:

Retrieving data to and from Elastic MapReduce HDFS

+8
logging amazon-s3 elastic-map-reduce hadoop hadoop-streaming
source share
1 answer

This is a very broad open-ended question, but here are some thoughts you might consider:

  • Using Amazon SQS: This is a distributed queue and is very useful for managing workflows. You have a process that writes to the queue as soon as the journal is available, and the other who reads it processes the described journal in the queue message, and deletes it when processing is performed. This will provide log processing only once.
  • Apache Flume, as you mentioned, is very useful for log aggregation. This is something you should consider, even if you do not need it in real time, as it gives you at least a standardized aggregation process.
  • Amazon recently released SimpleWorkFlow. I just started to study it, but it sounds promising to manage every step of your data pipeline.

Hope you get some tips.

+3
source share

All Articles