Django with huge mysql database

Question

Django with huge mysql database

What would be the best way to import multi-million dollar csv files into django.

Currently, using the csv python module, it takes 2-4 days to process 1 million record files. It does some checking if the record already exists, and several others.

Is it possible to complete this process in a few hours.

Can memcache be used in some way.

Update: There are django ManyToManyField fields that are also being processed. How they will be used with direct load.

+4

python django csv

bobsr Aug 2 '10 at 6:08

source share

5 answers

A. Ionescu · Answer 1 · 2010-08-02T06:39:35+0000

I'm not sure about your case, but we had a similar scenario with Django, where ~ 30 million records took more than one day to import.

Since our client was completely dissatisfied (with the danger of losing the project), after several unsuccessful optimization attempts with Python, we adopted a radical change in strategy and imported (only) from Java and JDBC (+ some mysql settings), and the import time was reduced to ~ 45 minutes ( it was very easy to optimize with Java because of the very good support for the IDE and profiler).

Shane reustle · Answer 2 · 2010-08-02T06:34:58+0000

I would suggest using the MySQL Python driver directly. In addition, you can consider some multithreading options.

Craig trader · Answer 3 · 2010-08-02T16:42:51+0000

Depending on the data format (you said CSV) and the database, it is probably better for you to load the data directly into the database (either directly into tables managed by Django or into temporary tables). As an example, Oracle and SQL Server provide custom tools for loading large amounts of data. In the case of MySQL, there are many tricks you can do. As an example, you can write a perl / python script to read a CSV file and create a SQL script with insert statements, and then directly pass the SQL script to MySQL.

As already noted, always load indexes and triggers before loading large amounts of data, and then add them back - rebuilding indexes after each insert is the main hit of processing.

If you use transactions, disable them or load your inserts so that the transactions are not too large (the definition is too large, but if you make a million rows of data, breaking them into 1 is probably about three thousand transactions).

And most importantly, CONTACT YOUR DATABASE FIRST! The only thing worse than restoring your database from a backup due to the import screw does not have a current backup to restore.

Parand · Answer 4 · 2010-08-02T17:30:10+0000

As already mentioned, you want to bypass ORM and go directly to the database. Depending on what type of database you are using, you will probably find good options for directly loading CSV data. With Oracle, you can use External Tables for very fast data loading speeds, and for mysql you can use the LOAD command . I am sure there is something similar for Postgres.

Downloading several million records should not take about 2-4 days; I regularly load a database with several million rows in mysql running on a very busy computer in minutes using mysqldump.

Dominique guardiola · Answer 5 · 2011-01-23T07:48:12+0000

As Craig said, you'd better fill out db first. This implies creating django models that simply correspond to CSV cells (then you can create better models and scripts for moving data).

Then, db feedding: Navicat is the tool of choice for this, you can get a functional 30-day demo on your site. It allows you to import CSV into MySQL, save the import profile in XML ...
Then I ran the data management scripts from Django, and when you are done, move your model south to get what you want, or, as I said earlier, create another set of models in your project and use the scripts to convert / copy data.

Django with huge mysql database

More articles: