Endless loop when streaming .gz file from S3 using boto

I am trying to transfer a .gz file with S3 using boto, and iterate over the lines in a text file without unpacking. Mysteriously, the cycle never ends; when the whole file has been read, the iteration restarts at the beginning of the file.

Let's say I create and load an input file as follows:

> echo '{"key": "value"}' > foo.json > gzip -9 foo.json > aws s3 cp foo.json.gz s3://my-bucket/my-location/ 

and I run the following Python script:

 import boto import gzip connection = boto.connect_s3() bucket = connection.get_bucket('my-bucket') key = bucket.get_key('my-location/foo.json.gz') gz_file = gzip.GzipFile(fileobj=key, mode='rb') for line in gz_file: print(line) 

Result:

 b'{"key": "value"}\n' b'{"key": "value"}\n' b'{"key": "value"}\n' ...forever... 

Why is this happening? I think that there must be something very basic that I am missing.

+7
python amazon-s3 gzip boto
source share
2 answers

Ah, boto. The problem is that the read method reloads the key if you call it after the key has been completely read once (compare the read and the following methods to see the difference).

This is not the cleanest way to do this, but it solves the problem:

 import boto import gzip class ReadOnce(object): def __init__(self, k): self.key = k self.has_read_once = False def read(self, size=0): if self.has_read_once: return b'' data = self.key.read(size) if not data: self.has_read_once = True return data connection = boto.connect_s3() bucket = connection.get_bucket('my-bucket') key = ReadOnce(bucket.get_key('my-location/foo.json.gz')) gz_file = gzip.GzipFile(fileobj=key, mode='rb') for line in gz_file: print(line) 
+10
source share

Thanks to zweiterlinde for the wonderful understanding and excellent response .

I was looking for a solution to read a compressed S3 object directly in a Pandas DataFrame and using its wrapper, it can be expressed in two lines:

 with gzip.GzipFile(fileobj=ReadOnce(bucket.get_key('my/obj.tsv.gz')), mode='rb') as f: df = pd.read_csv(f, sep='\t') 
0
source share

All Articles