Reading a file from a private S3 bucket into the pandas framework

Question

Reading a file from a private S3 bucket into the pandas framework

I am trying to read a CSV file from a private S3 bucket in a pandas dataframe:

df = pandas.read_csv('s3://mybucket/file.csv')

I can read the file from the public bucket, but reading the file from the private bucket results in HTTP 403: Forbidden error.

I configured AWS credentials using aws configure.

I can upload a file from my personal bucket using boto3, which uses aws credentials. It seems I need to configure pandas to use AWS credentials, but don't know how to do this.

+8

pandas amazon-web-services

IgorK Mar 04 '16 at 18:37

source share

6 answers

Updated for Pandas 0.20.1

Pandas now uses s3fs to handle s3 coonnections. link

Pandas now uses s3fs to handle S3 connections. This should not break any code. However, since s3fs is not a required dependency, you will need to install it separately, for example, boto in previous versions of pandas.

 import os import pandas as pd from s3fs.core import S3FileSystem # aws keys stored in ini file in same path # refer to boto3 docs for config settings os.environ['AWS_CONFIG_FILE'] = 'aws_config.ini' s3 = S3FileSystem(anon=False) key = 'path\to\your-csv.csv' bucket = 'your-bucket-name' df = pd.read_csv(s3.open('{}/{}'.format(bucket, key), mode='rb') )

+11

spitfiredd May 08 '17 at 1:57

source share

Update for pandas 0.20.3 without using s3fs:

 import boto3 import pandas as pd import sys if sys.version_info[0] < 3: from StringIO import StringIO # Python 2.x else: from io import StringIO # Python 3.x s3 = boto3.client('s3') obj = s3.get_object(Bucket='bucket', Key='key') body = obj['Body'] csv_string = body.read().decode('utf-8') df = pd.read_csv(StringIO(csv_string))

+2

jpobst 20 sept '17 at 13:23

source share

Based on this answer , I found that smart_open much easier to use:

 import pandas as pd from smart_open import smart_open initial_df = pd.read_csv(smart_open('s3://bucket/file.csv'))

+2

kepler Jul 30 '18 at 10:52

source share

Update for pandas 0.22 and higher:

If you already installed s3fs ( pip install s3fs ), you can read the file directly along the s3 path without any imports:

 data = pd.read_csv('s3:/bucket....csv')

stable documents

+2

Isaac Jan 2 '19 at 13:07

source share

Please note that if your basket is private And with an aws-like provider, you will encounter errors, because s3fs does not load the profile configuration file in ~/.aws/profile like awscli .

One solution is to determine the current environment variable:

 export AWS_S3_ENDPOINT="" export AWS_DEFAULT_REGION=""

0

Mcmzl May 03 '19 at 14:51

source share

Tomugspurger · Accepted Answer · 2016-03-04T19:02:33+0000

Pandas uses boto (not boto3 ) inside read_csv . You may be able to install boto and work correctly.

There are some problems with boto and python 3.4.4 / python3.5.1. If you are on these platforms, and until they are fixed, you can use boto 3 as

 import boto3 import pandas as pd s3 = boto3.client('s3') obj = s3.get_object(Bucket='bucket', Key='key') df = pd.read_csv(obj['Body'])

There was a .read method for .read (which returns a stream of bytes), which is enough for pandas.

Reading a file from a private S3 bucket into the pandas framework

More articles: