S3DistCp Grouping by Folder

I am trying to use S3DistCpto get around the problem of small files in Hadoop. It works, but the output is a little annoying for the job. The path to the file I'm dealing with is similar:

s3://test-bucket/test/0000eb6e-4460-4b99-b93a-469d20543bf3/201402.csv

and there may be several files in this folder. I want to group by folder name, so I use the following group by argument in s3distcp:

--groupBy '.*(........-.........-....-............).*'

and it groups the files, but as a result, it still results in multiple output folders with one file in each folder. Is there a way to list grouped files in one folder instead of several?

Thank!

+4
source share
2 answers

2015-11-20 S3DistCp. . .

+2

, : --groupBy ".*/(........-.........-....-............)/.*"

- : --src "s3://test-bucket/test/"

, .

+1

All Articles