How to load several objects from S3 at the same time?

Question

How to load several objects from S3 at the same time?

I have many (millions) small log files in s3 with a name (date / time) to help identify it, i.e. servername-yyyy-mm-dd-HH-MM. eg.

s3://my_bucket/uk4039-2015-05-07-18-15.csv s3://my_bucket/uk4039-2015-05-07-18-16.csv s3://my_bucket/uk4039-2015-05-07-18-17.csv s3://my_bucket/uk4039-2015-05-07-18-18.csv ... s3://my_bucket/uk4339-2015-05-07-19-23.csv s3://my_bucket/uk4339-2015-05-07-19-24.csv ... etc

From EC2, using AWS CLI , I would like to simultaneously download all files that have a minute of 16 for 2015, for all only the uk4339 and uk4338 server

Is there any reasonable way to do this?

Also, if this is a terrible file structure in s3 for querying data, I would be extremely grateful for any advice on how to install this better.

I can put the appropriate aws s3 cp ... command in a loop in the / bash shell to load the corresponding files sequentially, but I was wondering if something was more efficient.

As an added bonus, which I would like to tie together, the results are also combined as one csv.

A quick example mock csv file can be generated in R using this line of R code

 R> write.csv(data.frame(cbind(a1=rnorm(100),b1=rnorm(100),c1=rnorm(100))),file='uk4339-2015-05-07-19-24.csv',row.names=FALSE)

Created by csv uk4339-2015-05-07-19-24.csv . FYI, I will import the combined data into R at the end.

+5

r amazon-s3 amazon-web-services amazon-ec2 aws-cli

hlm May 07 '15 at 17:38

source share

1 answer

Mark setchell · Answer 1 · 2015-05-08T07:44:39+0000

Since you did not answer my questions and did not indicate which OS you are using, it is somewhat difficult to make any specific suggestions, so I will briefly suggest that you use GNU Parallel to parallelize your S3 sample requests to bypass the delay.

Suppose you somehow generated a list of all the S3 files you want, and put the resulting list in a file called GrabMe.txt , like this

 s3://my_bucket/uk4039-2015-05-07-18-15.csv s3://my_bucket/uk4039-2015-05-07-18-16.csv s3://my_bucket/uk4039-2015-05-07-18-17.csv s3://my_bucket/uk4039-2015-05-07-18-18.csv

Then you can get them in parallel, say 32 at a time, for example:

 parallel -j 32 echo aws s3 cp {} . < GrabMe.txt

or if you prefer to read from left to right

 cat GrabMe.txt | parallel -j 32 echo aws s3 cp {} .

Obviously, you can change the number of concurrent requests from 32 to any other number. At the moment, this is just an echo command that it executed, but you can delete the word echo when you see how it works.

There is a good tutorial here , and Ole Tange (author of GNU Parallel) is on SO, so we are in a good company.

How to load several objects from S3 at the same time?

More articles: