I have an S3 bucket with about 4 million files, occupying about 500 GB. I need to synchronize files with a new bucket (actually changing the name of the bucket would be enough, but since this is not possible, I need to create a new bucket, move the files there and delete the old one).
I use the AWS CLI s3 sync command, and it does the job, but takes a lot of time. I would like to reduce the time, so that the downtime of the dependent system is minimal .
I tried to start synchronization from both my local machine and the EC2 c4.xlarge , and there is not much time difference.
I noticed that the time, which can be slightly reduced, when I split the task into several batches using the --exclude and --include and run them in parallel from separate terminal windows, i.e.
aws s3 sync s3://source-bucket s3://destination-bucket --exclude "*" --include "1?/*" aws s3 sync s3://source-bucket s3://destination-bucket --exclude "*" --include "2?/*" aws s3 sync s3://source-bucket s3://destination-bucket --exclude "*" --include "3?/*" aws s3 sync s3://source-bucket s3://destination-bucket --exclude "*" --include "4?/*" aws s3 sync s3://source-bucket s3://destination-bucket --exclude "1?/*" --exclude "2?/*" --exclude "3?/*" --exclude "4?/*"
Is there anything else that I can speed up synchronization even more? Is another type of EC2 instance more suitable for work? Is splitting a task into several batches a good idea and is there something like an โoptimalโ number of sync processes that can run in parallel in the same bucket?
Update
I am inclined to the strategy of synchronizing buckets before you lower the system, perform the migration, and then synchronize the buckets again to copy only a small number of files that have changed during this time. However, executing the same sync command even on buckets without differences takes a lot of time.
source share