S3DistCp Grouping by Folder

Question

S3DistCp Grouping by Folder

I am trying to use S3DistCpto get around the problem of small files in Hadoop. It works, but the output is a little annoying for the job. The path to the file I'm dealing with is similar:

s3://test-bucket/test/0000eb6e-4460-4b99-b93a-469d20543bf3/201402.csv

and there may be several files in this folder. I want to group by folder name, so I use the following group by argument in s3distcp:

--groupBy '.*(........-.........-....-............).*'

and it groups the files, but as a result, it still results in multiple output folders with one file in each folder. Is there a way to list grouped files in one folder instead of several?

Thank!

+4

amazon-s3 amazon-web-services hadoop emr

Binal patel Feb 26 '15 at 1:45

source share

2 answers

ChristopherB · Answer 1 · 2015-11-20T22:17:11+0000

2015-11-20 S3DistCp. . .

gprivitera · Answer 2 · 2015-11-19T16:35:27+0000

, : --groupBy ".*/(........-.........-....-............)/.*"

- : --src "s3://test-bucket/test/"

, .

S3DistCp Grouping by Folder

More articles: