I parse the access logs created by Apache, Nginx, Darwin (video streaming server) and aggregate statistics for each delivered file by date / referrer / useragent.
Tons of logs are generated every hour, and this number can be significantly increased in the near future - therefore, processing such data in a distributed way through Amazon Elastic MapReduce sounds reasonable.
Now I am ready with the help of maps and reducers to process my data and have tested the whole process with the following thread:
- loaded maps, gears and data in Amazon S3
- set up the appropriate task and successfully processed it
- uploaded the aggregated results from Amazon S3 to my server and injected them into the MySQL database by running the CLI script
I did it manually according to the thousands of tutorials you can find on the Internet about Amazon ERM.
What should I do next? What is the best approach to automate this process?
What are common practices for:
- Using cron to control Amazon EMR jobTracker via API?
- How can I make sure my logs will not be processed twice?
- Should I control the movement / deletion of processed / result files using my own script?
- What is the best approach to processing results for inserting them into PostgreSQL / MySQL?
- Should I create different "input" / "output" directories for each job or use the same directories for all jobs?
- Should I create a new task every time through the API?
- What is best for uploading raw magazines to Amazon S3? I looked at Apache Flume , but I'm not sure if this is what I need, as long as I don't need real-time processing logs.
- How do you control this new piece of logs from Apache, are nginx ready to be uploaded to Amazon? (log rotation?)
- Can anyone share their data flow setup?
- How do you control file downloads and shutdowns?
Of course, in most cases it depends on your infrastructure and application architecture.
Of course . I can implement this with my custom solution, perhaps by re-investing a lot of things that are already in use by someone else.
But there should be some kind of common practice that I would like to familiarize myself with.
I think this topic can be useful for many people who are trying to process access logs using Amazon Elastic MapReduce, but have not been able to find good materials about the best methods for handling this.
UPD: Just for clarification, the question is:
What are the best Amazon Elastic MapReduce log processing methods?
Related posts:
Retrieving data to and from Elastic MapReduce HDFS
logging amazon-s3 elastic-map-reduce hadoop hadoop-streaming
webdevbyjoss
source share