The fastest way to sync two Amazon S3 buckets

Question

The fastest way to sync two Amazon S3 buckets

I have an S3 bucket with about 4 million files, occupying about 500 GB. I need to synchronize files with a new bucket (actually changing the name of the bucket would be enough, but since this is not possible, I need to create a new bucket, move the files there and delete the old one).

I use the AWS CLI s3 sync command, and it does the job, but takes a lot of time. I would like to reduce the time, so that the downtime of the dependent system is minimal .

I tried to start synchronization from both my local machine and the EC2 c4.xlarge , and there is not much time difference.

I noticed that the time, which can be slightly reduced, when I split the task into several batches using the --exclude and --include and run them in parallel from separate terminal windows, i.e.

 aws s3 sync s3://source-bucket s3://destination-bucket --exclude "*" --include "1?/*" aws s3 sync s3://source-bucket s3://destination-bucket --exclude "*" --include "2?/*" aws s3 sync s3://source-bucket s3://destination-bucket --exclude "*" --include "3?/*" aws s3 sync s3://source-bucket s3://destination-bucket --exclude "*" --include "4?/*" aws s3 sync s3://source-bucket s3://destination-bucket --exclude "1?/*" --exclude "2?/*" --exclude "3?/*" --exclude "4?/*"

Is there anything else that I can speed up synchronization even more? Is another type of EC2 instance more suitable for work? Is splitting a task into several batches a good idea and is there something like an “optimal” number of sync processes that can run in parallel in the same bucket?

Update

I am inclined to the strategy of synchronizing buckets before you lower the system, perform the migration, and then synchronize the buckets again to copy only a small number of files that have changed during this time. However, executing the same sync command even on buckets without differences takes a lot of time.

+12

amazon-s3 amazon-web-services amazon-ec2 aws-cli

mrt Aug 25 '16 at 15:25

source share

5 answers

strongjz · Answer 1 · 2016-12-29T04:39:22+0000

You can use EMR and S3-distcp. I had to sync 153 TB between two buckets, and it took about 9 days. Also, make sure that the segments are in the same region because you are also facing data transfer costs.

 aws emr add-steps --cluster-id <value> --steps Name="Command Runner",Jar="command-runner.jar",[{"Args":["s3-dist-cp","--s3Endpoint","s3.amazonaws.com","--src","s3://BUCKETNAME","--dest","s3://BUCKETNAME"]}]

http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/UsingEMR_s3distcp.html

http://docs.aws.amazon.com/ElasticMapReduce/latest/ReleaseGuide/emr-commandrunner.html

anapsix · Answer 2 · 2018-03-05T13:29:42+0000

As an option of what the OP is already doing.
You can create a list of all files for synchronization, aws s3 sync --dryrun

 aws s3 sync s3://source-bucket s3://destination-bucket --dryrun # or even aws s3 ls s3://source-bucket --recursive

Using the list of objects to synchronize, split the task into several aws s3 cp ... commands. Thus, "aws cli" will not just hang there, getting a list of synchronization candidates, as it happens when running several synchronization jobs with arguments like --exclude "*" --include "1?/*" .

When all "copies" of tasks are completed, another synchronization may be worth it, for good measure, perhaps with --delete if the object can be deleted from the "source" bucket.

In the case of “source” and “target” codes located in different regions, you can enable the cross-region of replication before running buckets for synchronization.

Aleksandr Dubinsky · Answer 3 · 2018-05-31T12:58:35+0000

Reference Information. The bottlenecks in the synchronization team are listing objects and copying objects. Enumerating objects is usually a sequential operation, although if you specify a prefix, you can list a subset of objects. This is the only trick to parallelize it. Copying objects can be done in parallel.

Unfortunately, aws s3 sync does not perform parallelization and does not even support prefix enumeration if the prefix does not end with / (i.e. aws s3 sync It can list by folders). That is why it is so slow.

s3s3mirror (and many similar tools) parallelize copying. I don’t think that he (or any other tools) parallelizes the listing of objects, because it requires a priori knowledge of how objects are called. However, it supports prefixes, and you can call it several times for each letter of the alphabet (or whatever).

You can also do this yourself using the AWS API.

Finally, aws s3 sync (and any other tool) should be a little faster if you run it on an instance in the same region as your S3 bucket.

Awais kaleem · Answer 4 · 2019-06-20T13:50:00+0000

Found another one here. Simple, without a server. https://aws.amazon.com/blogs/compute/synchronizing-amazon-s3-buckets-using-aws-step-functions/

Pruthvi raj · Answer 5 · 2019-06-28T18:56:16+0000

40,000 160 GB objects were copied / synced in less than 90 seconds

follow these steps:

 step1- select the source folder step2- under the properties of the source folder choose advance setting step3- enable transfer acceleration and get the endpoint

AWS configurations only once (no need to repeat this every time)

 aws configure set default.region us-east-1 #set it to your default region aws configure set default.s3.max_concurrent_requests 2000 aws configure set default.s3.use_accelerate_endpoint true

options: -

--delete: this option will delete the file at the destination if it is not in the source

AWS command to sync

 aws s3 sync s3://source-test-1992/foldertobesynced/ s3://destination-test-1992/foldertobesynced/ --delete --endpoint-url http://soucre-test-1992.s3-accelerate.amazonaws.com

transmission acceleration cost

https://aws.amazon.com/s3/pricing/#S3_Transfer_Acceleration_pricing

they did not mention the price if the buckets are in the same region

The fastest way to sync two Amazon S3 buckets

More articles: