Import a CSV file into a PostgreSQL database using Python-Django

Question

Import a CSV file into a PostgreSQL database using Python-Django

Note. Scroll down to the Background section for more details. Suppose the project uses Python-Django and South, in the following figure.

What is the best way to import the next CSV

"john","doe","savings","personal" "john","doe","savings","business" "john","doe","checking","personal" "john","doe","checking","business" "jemma","donut","checking","personal"

To a PostgreSQL database with related tables Person, Account and AccountType, taking into account:

Administrator users can change the database model and real-time CSV import view through the user interface
Stored CSV database table and field mappings are used when regular users import CSV files.

So far, two approaches have been considered.

ETL-API Approach: Providing an ETL API spreadsheet, table / field mappings in a CSV database, and information about connecting to the target database. The API will then load the spreadsheet and populate the target database tables. Looking at pygrametl, I don’t think that what I aim for is possible. In fact, I'm not sure any ETL APIs do this.
Row-level insertion approach: analysis of mappings of tables and fields of a CSV database, parsing a spreadsheet, and creating SQL attachments to the join order.

I implemented the second approach, but I am struggling with the flaws of the algorithm and the complexity of the code. Is there an python ETL API that does what I want? Or an approach that does not involve reinventing the wheel?

Background

The company I work with wants to move hundreds of project tables hosted on sharepoint to databases. We are almost completing a web application that meets the needs, allowing the administrator to define / model a database for each project, store spreadsheets in it and determine the viewing experience. At this stage of completion, switching to a commercial tool is not an option. Think of a web application as an alternative to django-admin, although it is not, with a DB modeling user interface, CSV import / export function, custom viewing and modular code to solve project-specific settings.

The implemented CSV import interface is cumbersome and buggy, so I'm trying to get feedback and find alternative approaches.

+6

python django postgresql etl pygrametl

Mario aguilera Mar 18 '13 at 5:15

source share

4 answers

How to divide a problem into two separate problems?

Create a Person class that represents the person in the database. This can use Django ORM or extend it, or you can do it yourself.

You now have two problems:

Create an instance of Person from a string in CSV.
Save the Person instance to the database.

Now instead of CSV-to-Database you have CSV-to-Person and Person-to-Database. I think this is conceptually cleaner. When administrators change the scheme, it changes the "from person to base" side. When administrators change the CSV format, they change the side of the CSV-to-Database. Now you can deal with each separately.

Does it help?

+2

Claudiu Mar 20 '13 at 16:17

source share

I write import subsystems almost every month at work, and as I do such tasks, I once wrote django-data-importer. This importer works as a django form and has readers for CSV, XLS, and XLSX files that provide you with dicts lists.

Using data_importer readers, you can read the file into dicts lists, an iterator on it with for and save lines makes DB. With the importer you can do the same, but with the bonus of checking each field of the line, errors and actions of the log and saving them at the end.

Please see https://github.com/chronossc/django-data-importer . I am sure that it will solve your problem and help you with the process of any kind of csv file :)

To solve your problem, I suggest using a data importer with celery tasks. You download the file and start the import task using a simple interface. The celery task will send the file to the importer, and you can check the lines, save them, register errors for it. With some effort, you can even imagine the progress of the task for users who have downloaded the sheet.

0

chronossc Mar 20 '13 at 17:53

source share

Here is another approach I found on github. It basically detects a schema and allows overrides. Its purpose is simply to create raw sql to execute psql and any other driver.

https://github.com/nmccready/csv2psql

  % python setup.py install % csv2psql --schema=public --key=student_id,class_id example/enrolled.csv > enrolled.sql % psql -f enrolled.sql

There are also many options for modifying (creating primary keys from many existing columns) and merging / dumping.

0

Nick Oct 31 '14 at 1:11

source share

Mario aguilera · Accepted Answer · 2013-03-26T04:58:12+0000

In the end, I took a few steps back to solve this problem on the Occam razor using updatable SQL views. This meant several victims:

Delete: South.DB-based real-time schema administration API, dynamic model loading, and dynamic ORM synchronization.
Defining .py models and initial south migration manually.

This allows a simple approach to import flat data sets (CSV / Excel) into a normalized database:

Define unmanaged models in models.py for each table
Match these updates with updatable SQL views (INSERT / UPDATE-INSTEAD SQL Rules) in the initial south migration that match the spreadsheet layout.
Iterate over the rows of the CSV / Excel table and execute INSERT INTO <VIEW> (<COLUMNS>) VALUES (<CSV-ROW-FIELDS>);

Import a CSV file into a PostgreSQL database using Python-Django

More articles: