How to get s3distcp to combine with newlines

Question

How to get s3distcp to combine with newlines

I have many millions of small s3 single line files that I am going to merge together. I have s3distcp syntax, however, I found that there are no newlines in the merged set after merging files.

I was wondering if s3distcp includes any option to force a new line, or is there another way to accomplish this without modifying the source files directly (or copying them and doing the same)

+4

amazon-s3 hadoop amazon-emr hadoop-streaming

isueightynine Jul 13 '15 at 21:20

source share

1 answer

maxymoo · Answer 1 · 2015-08-28T00:52:09+0000

If your text files begin / end with a unique sequence of characters, you can first combine them into one file with s3distcp(I did this by setting --targetSizea very large number), then use sedthe Hadoop stream to add new lines; in the following example, each file contains one json (file names begin with 0), and the command sedinserts a new line between each instance }{:

hadoop fs -mkdir hdfs:///tmpoutputfolder/
hadoop fs -mkdir hdfs:///finaloutputfolder/
hadoop jar lib/emr-s3distcp-1.0.jar \
               --src s3://inputfolder \
               --dest hdfs:///tmpoutputfolder \
               --targetSize 1000000000 \
               --groupBy ".*(0).*"
hadoop jar /home/hadoop/contrib/streaming/hadoop-streaming.jar \
               -D mapred.reduce.tasks=1 \
               --input hdfs:///tmpoutputfolder \
               --output hdfs:///finaloutputfolder \
               --mapper /bin/cat \
               --reducer '/bin/sed "s/}{/}\n{/g"'

How to get s3distcp to combine with newlines

More articles: