How to write data to Redshift, which is the result of data created in Python?

Question

How to write data to Redshift, which is the result of data created in Python?

I have a data frame in Python. Can I write this data to Redshift as a new table? I have successfully created a db connection with Redshift and can perform simple sql queries. Now I need to write a data file for it.

+12

python pandas amazon-redshift dataframe psycopg2

Sahil Jul 15 '16 at 18:33

source share

7 answers

 import pandas_redshift as pr pr.connect_to_redshift(dbname = <dbname>, host = <host>, port = <port>, user = <user>, password = <password>) pr.connect_to_s3(aws_access_key_id = <aws_access_key_id>, aws_secret_access_key = <aws_secret_access_key>, bucket = <bucket>, subdirectory = <subdirectory>) # Write the DataFrame to S3 and then to redshift pr.pandas_to_redshift(data_frame = data_frame, redshift_table_name = 'gawronski.nba_shots_log')

Details: https://github.com/agawronski/pandas_redshift

+6

Aidangawronski Aug 2 '17 at 5:18

source share

Assuming you have access to S3, this approach should work:

Step 1: Write the DataFrame as csv-S3 (for this I use the AWS SDK boto3)
Step 2. You know the columns, data types, and key / index for your Redshift table from your DataFrame, so you should be able to generate a create table script and click on Redshift to create an empty table
Step 3: Send the copy command from your Python environment to Redshift to copy the data from S3 to the empty table created in step 2

It works like a charm every time.

Step 4: Before your cloud storage users start screaming, you will remove csv from S3

If you see that you are doing this several times, wrapping all four steps in a function preserves order.

+4

Bigpanda Feb 04 '17 at 23:50

source share

I tried to use pandas df.to_sql() but it was very slow. It took me more than 10 minutes to insert 50 rows. See this open-ended question (at time of writing)

I tried using odo from the odo ecosystem (as recommended in the discussion of the problem), but ran into an odo ProgrammingError which I did not bother to investigate.

Finally, what worked:

 import psycopg2 # Fill in the blanks for the conn object conn = psycopg2.connect(user = 'user', password = 'password', host = 'host', dbname = 'db', port = 666) cursor = conn.cursor() args_str = b','.join(cursor.mogrify("(%s,%s,...)", x) for x in tuple(map(tuple,np_data))) cursor.execute("insert into table (a,b,...) VALUES "+args_str.decode("utf-8")) cursor.close() conn.commit() conn.close()

Yes, old psycopg2 . This is for a simple array, but converting from df to ndarray should not be too complicated. This gave me about 3 thousand lines per minute.

However, the quickest solution in accordance with the recommendations of other teammates is to use the COPY command after uploading the data frame as TSV / CSV to the S3 cluster and then copying it. You have to figure this out if you are copying really huge data sets. (I will update here if and when I try this)

+4

Gaurav Jul 18 '17 at 17:00

source share

For this conversation, Postgres = RedShift. You have two options:

Option 1:

From Pandas: http://pandas.pydata.org/pandas-docs/stable/io.html#io-sql

The pandas.io.sql module provides a set of query wrappers to facilitate data retrieval and reduce dependency on a database-specific API. Database abstraction is provided by SQLAlchemy, if installed. In addition, you will need a driver library for your database. Examples of such drivers are psycopg2 for PostgreSQL or pymysql for MySQL.

Writing DataFrames

Assuming the following data is in a DataFrame, we can insert it into the database using to_sql ().

 id Date Col_1 Col_2 Col_3 26 2012-10-18 X 25.7 True 42 2012-10-19 Y -12.4 False 63 2012-10-20 Z 5.73 True In [437]: data.to_sql('data', engine)

In some databases, writing large DataFrames can lead to errors due to exceeding packet size limits. This can be avoided by setting the chunksize parameter when calling to_sql. For example, the following writes data to the database in batches of 1000 rows at a time:

 In [438]: data.to_sql('data_chunked', engine, chunksize=1000)

Option 2

Or you can just make your own. If you have a dataframe called data, just iterate over it using iterrows:

 for row in data.iterrows():

then add each row to your database. I would use a copy instead of an insert for each row, as it will be much faster.

http://initd.org/psycopg/docs/usage.html#using-copy-to-and-copy-from

0

Michael robellard Jul 18 '16 at 18:28

source share

If I want to write to a subfolder of my database, how do I do this?

thanks

0

Shakhawat hossain turag Dec 19 '18 at 12:15

source share

I used to rely on to_sql() pandas, but it is too slow. I recently switched to the following:

 import pandas as pd import s3fs # great module which allows you to read/write to s3 easily import sqlalchemy df = pd.DataFrame([{'A': 'foo', 'B': 'green', 'C': 11},{'A':'bar', 'B':'blue', 'C': 20}]) s3 = s3fs.S3FileSystem(anon=False) filename = 'my_s3_bucket_name/file.csv' with s3.open(filename, 'w') as f: df.to_csv(f, index=False, header=False) con = sqlalchemy.create_engine('postgresql://username: password@yoururl.com :5439/yourdatabase') # make sure the schema for mytable exists # if you need to delete the table but not the schema leave DELETE mytable # if you want to only append, I think just removing the DELETE mytable would work con.execute(""" DELETE mytable; COPY mytable from 's3://%s' iam_role 'arn:aws:iam::xxxx:role/role_name' csv;""" % filename)

the role should allow access with redshift to S3, for details see here

I found that for a 300KB (12000x2) file, it takes 4 seconds compared to the 8 minutes I got with the to_sql() function pandas to_sql()

0

erncyp Jan 10 '19 at 16:34

source share

Andrew · Accepted Answer · 2016-09-26T18:24:44+0000

You can use to_sql to to_sql data in the Redshift database. I was able to do this using a connection to my database through the SQLAlchemy mechanism. Just remember to set index = False in the to_sql call. A table will be created if it does not exist, and you can specify whether you want to call to replace the table, add to the table, or fail if the table already exists.

 from sqlalchemy import create_engine import pandas as pd conn = create_engine('postgresql://username: password@yoururl.com :5439/yourdatabase') df = pd.DataFrame([{'A': 'foo', 'B': 'green', 'C': 11},{'A':'bar', 'B':'blue', 'C': 20}]) df.to_sql('your_table', conn, index=False, if_exists='replace')

Please note that you may need to pip install psycopg2 to connect to Redshift through SQLAlchemy.

To_sql documentation

How to write data to Redshift, which is the result of data created in Python?

More articles: