Download and compress file on s3

I recently started working with S3 and faced with this need to upload and compress large files (10 gb + -) on s3. The current implementation I'm working with is creating a temporary compressed localy file, then uploading it to s3 and permanently deleting the temporary file. The fact is that for a 10-gigabyte file, I have almost 20 GB of locale left until the download is completed. I need a way to transfer the file to s3 and then compress it there. Is this approach viable? If so, how can I address it? If not, is there a way to minimize the required local space? I saw someone ordering a coud file to be uploaded to S3, uploaded to EC2 in the same region, then compressed, and then uploaded back to S3, deleting the first copy to S3. This may work, but it seems to me that 2 downloads to get one file will not be cost-effective.

I tried to download the compression stream without success, but I just found that s3 does not support stream compression, and now I do not know how to do this.

I am using the gzip library on .NET.

+7
upload amazon-s3 amazon-web-services gzip compression
source share
5 answers

In the linux shell via aws-cli, this was added about 3 months after you asked the question :-)

Added the ability to stream data using cp

So, the best thing you can do, I think, is to pass gzip output to aws cli:

Download from stdin:

gzip -c big_file | aws s3 cp - s3://bucket/folder/big_file.gz

Download to stdout:

aws s3 cp s3://bucket/folder/big_file.gz - | gunzip -c ...

+5
source share

If the space is at a height in the place where you initially upload the file, then upload the file to S3 and then upload, compress and reload the file on S3 to an EC2 instance in the same region as the S3 bucket is actually very reasonable (if seems to be the opposite of an intuitive) sentence, for one simple reason:

AWS does not charge you for bandwidth between EC2 and S3 within the same region.

This is the perfect job for the spot instance ... and a good example of using SQS to tell the spot machine what it should have done.

On the other hand ... you spend most of your local bandwidth downloading this file if you do not compress it first.

If you are a programmer, you can create a utility similar to the one I wrote for internal use (this is not a plugin that is currently not available for release), which compresses (through external tools) and uploads files to S3 on the fly.

It works something like the command line example pseudocode:

 cat input_file | gzip -9c | stream-to-s3 --bucket 'the-bucket' --key 'the/path' 

This is a simplified use case to illustrate a concept. Of course, my stream-to-s3 utility takes a number of other arguments, including x-amz-meta metadata, a passkey, and the aws secret code, but you can get this idea, perhaps.

Common compression utilities such as gzip, pigz, bzip2, pbzip2, xz, and pixz can read the source file from STDIN and write compressed data to STDOUT without writing the compressed version of the file to disk.

The utility that I use reads the file data from STDIN through the pipeline and using S3 Multipart Upload (even for small files that do not need it technically, since S3 Multipart Upload does not dexterously require you to know the file size in advance), it simply continues to send data to S3 until it reaches EOF in its input stream. He then completes the multi-page download and ensures that everything is completed.

I use this utility to create and load entire archives with compression without affecting a single block of disk space. Again, writing was not particularly difficult, and this could be done in several languages. I didn’t even use the S3 SDK, I rolled back from scratch using the standard HTTP agent and the S3 API documentation.

+4
source share

I need a way to transfer the file to s3 and then compress it there. Is this approach viable?

This approach is not viable / optional. Compression requires a lot of CPU resources, and Amazon S3 does data storage without performing complex processing of your files.

With S3, you also pay bandwidth for what you download, so you spend money sending more data, then you will need to.

I saw someone order a coud file to be uploaded to S3, uploaded to EC2 in the same region, compressed there, and then uploaded back to S3 when deleting the first copy on S3.

What you can do is upload directly to EC2, compress there, and then upload to S3. But now you have moved the 20GB issue from your local machine to an EC2 instance.

The best approach is to continue using the current compression approach locally and then download.

+1
source share

If you use .NET, you can make a char stream, but you still need local storage larger than 20 GB.

In addition, to be the bearer of bad news S3 from Amazon - it's just a repository. You may need to deploy one more service (aws) than run a program that may be compressed on disk. Thus, your application downloads and compresses using S3 storage.

If your project is smaller, you might want to consider an IaaS provider rather than PaaS. Thus, the storage and application can reside on the same set of servers.

0
source share

One very important feature of S3 for download bandwidth is parallel loading. There are several tools like aws cli, s3cmd or crossftp. From the .NET API, the same thing could be achieved with the TransferUtility class

If you really need compression, look at S3DistCP , a tool that can perform translations using multiple machines in parallel and compress fly.

0
source share

All Articles