AWS DynamoDB and MapReduce in Java

I have a huge DynamoDB table that I want to analyze in order to aggregate the data that is stored in its attributes. Then the aggregated data is processed by the Java application. Although I understand the really basic concepts of MapReduce, I have never used them before.

In my case, let's say that I have the customerId and orderNumbers in each DynamoDB element, and that I can have more than one element for the same client. How:

 customerId: 1, orderNumbers: 2 customerId: 1, orderNumbers: 6 customerId: 2, orderNumbers: -1 

Basically, I want to summarize orderNumbers for each customerId, and then do some Java operations using an aggregate.

AWS Elastic MapReduce can probably help me, but I don’t understand how to connect a custom JAR with DynamoDB. My custom JAR should probably expose map and reduce functions, where can I find the right interface to implement?

Plus I'm a bit confused by the docs, it seems to me that I must first export my data to S3 before launching my JAR. Is it correct?

thanks

+7
source share
2 answers

Note. I did not create a working EMR, I just read about it.

First of all, Prerequisites for integrating Amazon EMR with Amazon DynamoDB

You can work directly on DynamoDB: Sample commands for exporting, importing, and querying data in Amazon DynamoDB . As you can see, you can execute queries like "SQL-like".

If you have zero knowledge of Hadoop, you should probably read some reference materials, such as: What is Hadoop

This Tutorial is Another Good Reading Using Amazon Elastic MapReduce with DynamoDB

As for your custom JAR application, you need to upload it to S3. Use this guide: How to create a job stream using a custom JAR

Hope this helps you get started.

+3
source

Also see: http://aws.amazon.com/code/Elastic-MapReduce/28549 - which also uses Hive to access DynamoDB. This is apparently the official AWS access method for Hadoop DynamoDB.

If you need to write custom code in a custom JAR, I discovered: DynamicDB InputFormat for Hadoop

However, I could not find documentation on Java parameters for this InputFormat that match the Hive parameters. According to this article, Amazon was not released: http://www.newvem.com/amazon-dynamodb-part-iii-mapreducin-logs/

Also see: jar containing org.apache.hadoop.hive.dynamodb

Therefore, the official documented way to use DynamoDB data from a MapReduce custom job is to export DynamoDB data to S3 and then let Elastic MapReduce take it from S3. I assume this is because DynamoDB was designed for random access as a β€œNoSQL” key / value store, while Hadoop input and output formats are designed for sequential access with large block sizes. Amazon's undocumented code may be some tricks to fill this gap.

Since export / re-import uses resources, it would be better if the task could be completed from within Hive.

0
source

All Articles