AWS EMR Output Combine

I ran an AWS EMR test case using a custom display module, but with NONE as the gearbox. I got the (expected) output in 13 separate "partial" files. How can I merge them into one file?

I don't need to collect data in any special way, and I don't care if it is sorted, reordered randomly or left in order. But I would like to efficiently return the data to a single file. Should I do this manually or is there a way to do this as part of an EMR cluster?

It is very strange to me that there is no default option or some kind of automatic step for this. I read a little about Identity Reducer. Does he do what I want, and if so, how to use it when starting the cluster through the EMR console?

My details are in S3.


EDIT

To be extremely clear, I can run catin all the output parts after completing the assignment, if that is what I have to do. Locally, or in an instance of EC2, or something else. Is this really what everyone is doing?

+4
source share
1 answer

If the output of the converter card parts themselves is small, you can try using them hadoop fs -getmergeto combine them into a local file system:

hadoop fs -getmerge s3n://BUCKET/path/to/output/ [LOCAL_FILE]

Then return the merged file to S3:

hadoop fs -put [LOCAL_FILE] s3n://BUCKET/path/to/put/

To execute the above commands, you must have the following properties set in core-site.xml

<property>
  <name>fs.s3n.awsAccessKeyId</name>
  <value>YOUR_ACCESS_KEY</value>
</property>

<property>
  <name>fs.s3n.awsSecretAccessKey</name>
  <value>YOUR_SECRET_ACCESS_KEY</value>
</property>
+3
source

All Articles