Reading a file from a private S3 bucket into the pandas framework

I am trying to read a CSV file from a private S3 bucket in a pandas dataframe:

df = pandas.read_csv('s3://mybucket/file.csv') 

I can read the file from the public bucket, but reading the file from the private bucket results in HTTP 403: Forbidden error.

I configured AWS credentials using aws configure.

I can upload a file from my personal bucket using boto3, which uses aws credentials. It seems I need to configure pandas to use AWS credentials, but don't know how to do this.

+8
source share
6 answers

Pandas uses boto (not boto3 ) inside read_csv . You may be able to install boto and work correctly.

There are some problems with boto and python 3.4.4 / python3.5.1. If you are on these platforms, and until they are fixed, you can use boto 3 as

 import boto3 import pandas as pd s3 = boto3.client('s3') obj = s3.get_object(Bucket='bucket', Key='key') df = pd.read_csv(obj['Body']) 

There was a .read method for .read (which returns a stream of bytes), which is enough for pandas.

+12
source

Updated for Pandas 0.20.1

Pandas now uses s3fs to handle s3 coonnections. link

Pandas now uses s3fs to handle S3 connections. This should not break any code. However, since s3fs is not a required dependency, you will need to install it separately, for example, boto in previous versions of pandas.

 import os import pandas as pd from s3fs.core import S3FileSystem # aws keys stored in ini file in same path # refer to boto3 docs for config settings os.environ['AWS_CONFIG_FILE'] = 'aws_config.ini' s3 = S3FileSystem(anon=False) key = 'path\to\your-csv.csv' bucket = 'your-bucket-name' df = pd.read_csv(s3.open('{}/{}'.format(bucket, key), mode='rb') ) 
+11
source

Update for pandas 0.20.3 without using s3fs:

 import boto3 import pandas as pd import sys if sys.version_info[0] < 3: from StringIO import StringIO # Python 2.x else: from io import StringIO # Python 3.x s3 = boto3.client('s3') obj = s3.get_object(Bucket='bucket', Key='key') body = obj['Body'] csv_string = body.read().decode('utf-8') df = pd.read_csv(StringIO(csv_string)) 
+2
source

Based on this answer , I found that smart_open much easier to use:

 import pandas as pd from smart_open import smart_open initial_df = pd.read_csv(smart_open('s3://bucket/file.csv')) 
+2
source

Update for pandas 0.22 and higher:

If you already installed s3fs ( pip install s3fs ), you can read the file directly along the s3 path without any imports:

 data = pd.read_csv('s3:/bucket....csv') 

stable documents

+2
source

Please note that if your basket is private And with an aws-like provider, you will encounter errors, because s3fs does not load the profile configuration file in ~/.aws/profile like awscli .

One solution is to determine the current environment variable:

 export AWS_S3_ENDPOINT="" export AWS_DEFAULT_REGION="" 
0
source

All Articles