I need to join all PostgreSQL tables and convert them to a Python dictionary. The database contains 72 tables. The total number of columns is more than 1600 .
I wrote a simple Python script that joins several tables, but cannot join them due to a memory error . All memory is taken at runtime. And I run the script on a new virtual server with 128 GB RAM and 8 processors. It does not work during lambda function execution.
How can the following code be improved to execute all connections?
from sqlalchemy import create_engine import pandas as pd auth = 'user:pass' engine = create_engine('postgresql://' + auth + '@host.com:5432/db') sql_tables = ['table0', 'table1', 'table3', ..., 'table72'] df_arr = [] [df_arr.append(pd.read_sql_query('select * from "' + table + '"', con=engine)) for table in sql_tables] df_join = reduce(lambda left, right: pd.merge(left, right, how='outer', on=['USER_ID']), df_arr) raw_dict = pd.DataFrame.to_dict(df_join.where((pd.notnull(df_join)), 'no_data')) print(df_join) print(raw_dict) print(len(df_arr))
Can I use Pandas for my purpose? Are there any better solutions?
The ultimate goal is to denormalize the DB data to be able to index it in Elasticsearch as documents, one document per user.
source share