I ran an AWS EMR test case using a custom display module, but with NONE as the gearbox. I got the (expected) output in 13 separate "partial" files. How can I merge them into one file?
I don't need to collect data in any special way, and I don't care if it is sorted, reordered randomly or left in order. But I would like to efficiently return the data to a single file. Should I do this manually or is there a way to do this as part of an EMR cluster?
It is very strange to me that there is no default option or some kind of automatic step for this. I read a little about Identity Reducer. Does he do what I want, and if so, how to use it when starting the cluster through the EMR console?
My details are in S3.
EDIT
To be extremely clear, I can run catin all the output parts after completing the assignment, if that is what I have to do. Locally, or in an instance of EC2, or something else. Is this really what everyone is doing?
source
share