I am trying to use S3DistCpto get around the problem of small files in Hadoop. It works, but the output is a little annoying for the job. The path to the file I'm dealing with is similar:
s3://test-bucket/test/0000eb6e-4460-4b99-b93a-469d20543bf3/201402.csv
and there may be several files in this folder. I want to group by folder name, so I use the following group by argument in s3distcp:
--groupBy '.*(........-.........-....-............).*'
and it groups the files, but as a result, it still results in multiple output folders with one file in each folder. Is there a way to list grouped files in one folder instead of several?
Thank!
source
share