Join all PostgreSQL tables and create a Python dictionary

I need to join all PostgreSQL tables and convert them to a Python dictionary. The database contains 72 tables. The total number of columns is more than 1600 .

I wrote a simple Python script that joins several tables, but cannot join them due to a memory error . All memory is taken at runtime. And I run the script on a new virtual server with 128 GB RAM and 8 processors. It does not work during lambda function execution.

How can the following code be improved to execute all connections?

from sqlalchemy import create_engine import pandas as pd auth = 'user:pass' engine = create_engine('postgresql://' + auth + '@host.com:5432/db') sql_tables = ['table0', 'table1', 'table3', ..., 'table72'] df_arr = [] [df_arr.append(pd.read_sql_query('select * from "' + table + '"', con=engine)) for table in sql_tables] df_join = reduce(lambda left, right: pd.merge(left, right, how='outer', on=['USER_ID']), df_arr) raw_dict = pd.DataFrame.to_dict(df_join.where((pd.notnull(df_join)), 'no_data')) print(df_join) print(raw_dict) print(len(df_arr)) 

Can I use Pandas for my purpose? Are there any better solutions?

The ultimate goal is to denormalize the DB data to be able to index it in Elasticsearch as documents, one document per user.

+5
source share
2 answers

Why don't you create a postgres function instead of a script?

Here are some tips to help you avoid a memory error:

  • You can use the WITH clause, which uses your memory better.
  • You can create some physical tables to store information about the various groups of tables in your database. These physical tables will avoid the use of large amounts of memory. After that, all you have to do is join only these physical tables. You can create a function for it.
  • You can create a data warehouse by denormalizing the necessary tables.
  • Last but not least, make sure you use Indexes accordingly.
+1
source

I'm not sure if this helps, but you can try pd.concat

 raw_dict = pd.concat([d.set_index('USER_ID') for d in df_arr], axis=1) 

Or to get a little more disciplines

 raw_dict = pd.concat([d.set_index('USER_ID') for d in df_arr], axis=1, keys=sql_tables) 

If this does not help, let me know and I will remove it.

0
source

All Articles