Download large archive from AWS Glacier using Boto

Question

Download large archive from AWS Glacier using Boto

I am trying to download a large archive (~ 1 TB) from Glacier using the Python package, Boto. The current method I'm using looks like this:

import os import boto.glacier import boto import time ACCESS_KEY_ID = 'XXXXX' SECRET_ACCESS_KEY = 'XXXXX' VAULT_NAME = 'XXXXX' ARCHIVE_ID = 'XXXXX' OUTPUT = 'XXXXX' layer2 = boto.connect_glacier(aws_access_key_id = ACCESS_KEY_ID, aws_secret_access_key = SECRET_ACCESS_KEY) gv = layer2.get_vault(VAULT_NAME) job = gv.retrieve_archive(ARCHIVE_ID) job_id = job.id while not job.completed: time.sleep(10) job = gv.get_job(job_id) if job.completed: print "Downloading archive" job.download_to_file(OUTPUT)

The problem is that the job ID expires after 24 hours, which is not enough to get the entire archive. I will need to split the download into at least 4 parts. How to do this and write the output to a single file?

+5

python amazon-web-services amazon-glacier boto

sahoang Jan 16 '15 at 13:58

source share

1 answer

volent · Accepted Answer · 2015-01-16T16:02:12+0000

It seems you can simply specify the chunk_size parameter when calling job.download_to_file as follows:

 if job.completed: print "Downloading archive" job.download_to_file(OUTPUT, chunk_size=1024*1024)

However, if you can’t download all the pieces within 24 hours, I don’t think you can only download the one that you skipped using layer2.

First method

Using layer1, you can simply use the get_job_output method and specify the range of bytes you want to load.

It will look like this:

 file_size = check_file_size(OUTPUT) if job.completed: print "Downloading archive" with open(OUTPUT, 'wb') as output_file: i = 0 while True: response = gv.get_job_output(VAULT_NAME, job_id, (file_size + 1024 * 1024 * i, file_size + 1024 * 1024 * (i + 1))) output_file.write(response) if len(response) < 1024 * 1024: break i += 1

With this script, you can restart the script when it does not work, and continue loading your archive where you left it.

Second method

Delving into the boto code, I found the "private" method in the Job class, which you can also use: _ download_byte_range . With this method, you can still use layer2.

 file_size = check_file_size(OUTPUT) if job.completed: print "Downloading archive" with open(OUTPUT, 'wb') as output_file: i = 0 while True: response = job._download_byte_range(file_size + 1024 * 1024 * i, file_size + 1024 * 1024 * (i + 1))) output_file.write(response) if len(response) < 1024 * 1024: break i += 1

Download large archive from AWS Glacier using Boto

First method

Second method

More articles: