MySQL Insert large datasets from a file using Java

I need to insert about 1.8 million lines from a CSV file into a MySQL database. (only one table)

Currently, Java is used to parse the file and insert each line.

As you can imagine, this takes several hours. (10 pieces)

The reason I am not laying this directly from the file in db is because the data needs to be manipulated before adding it to the database.

This process should be managed by an IT manager. So I configured it as a good batch file to run them after they dumped the new csv file to the right place. So, I need to make this work beautifully by dropping the file in a specific location and running the batch file. (Windows environment)

My question is which way would be the fastest way to insert this a lot of data; large inserts from a temporary parsed file or one insert at a time? maybe some other idea?

Second question: how can I optimize my MySQL installation to allow very fast inserts. (there will be a point where a large selection of all data is required)

Note: the table will eventually be deleted, and the whole process will start again later.

Some clarifications: currently used ... opencsv.CSVReader to parse the file and then insert into each line. I understand some columns and ignore others.

Additional explanations: Local DB MyISAM table

+6
java mysql
source share
12 answers

Quick Installation Tips:

  • Use the LOAD DATA INFILE syntax so that it can parse and paste it, even if you need to cast and feed it after manipulation.
  • Use this insert syntax:

    insert into the values ​​of the table (col1, col2) (val1, val2), (val3, val4), ...

  • Remove all keys / indexes before inserting.

  • Do it on the fastest machine you got (mainly mainly, but RAM and CPU). Both the database server and the insert client, remember that you will pay twice the cost of I / O (after reading, the second insert).
+14
source share

I would select a large number, for example, 10 thousand lines, and load a lot of lines from CSV, massage the data and perform a batch update, and then repeat until I missed the whole CSV. Depending on the massage / amount of data, 1.8 mil. Lines should not take 10 hours, more than 1-2 hours depending on your equipment.

edit: whoops, left by a pretty important part, your con should have autocommit set to false, the code I copied from this ran it as part of the GetConnection () method.

Connection con = GetConnection(); con.setAutoCommit(false); try{ PreparedStatement ps = con.prepareStatement("INSERT INTO table(col1, col2) VALUES(?, ?)"); try{ for(Data d : massagedData){ ps.setString(1, d.whatever()); ps.setString(2, d.whatever2()); ps.addBatch(); } ps.executeBatch(); }finally{ ps.close(); } }finally{ con.close(); } 
+4
source share

Are you completely sure you turned off automatic commits in the JDBC driver?

This is a typical performance killer for JDBC clients.

+2
source share

You really have to use LOAD DATA on the MySQL console itself for this and not work through code ...

 LOAD DATA INFILE 'data.txt' INTO TABLE db2.my_table; 

If you need to manipulate data, I still recommend manipulating in memory, overwriting it into a flat file, and pushing it to the database using LOAD DATA, I think this should be more efficient.

+1
source share

Another idea: do you use PreparedStatement to insert your data into JDBC?

+1
source share

Depending on what you need to do with the data before inserting it, your best options in terms of speed are:

Parse the file in java / do what you need with the data / write "massive" data to a new CSV file / use "load data infile". If your data processing is conditional (for example, you need to check the existence of a record and do different things depending on whether it will be inserted or updated, etc.), then (1) may not be possible. In this case, you are better off doing batch inserts / updates. An experiment to find the best batch size that works for you (starting at 500-1000 should be fine). Depending on the storage mechanism you use for your table, you may need to split it into multiple transactions, and the 1.8M single row row will not do wonders for performance.
+1
source share

The biggest performance issue is most likely not java, but mysql, in particular, any indexes, constraints, and foreign keys that you have on the table you are inserting into. Before embedding, make sure you turn them off. Re-including them at the end will take considerable time, but it is much more effective than evaluating the database after each statement.

You may also encounter mysql performance problems due to transaction size. Your transaction log will grow very large with so many inserts, so committing after the X-number of inserts (say 10,000-100,000) will also help to insert the speed.

At the jdbc level, make sure you use the addBatch () and executeBatch () commands on your PreparedStatement rather than the usual executeUpdate ().

+1
source share

You can improve INSERT bulk performance from MySQL / Java by using the batch processing capabilities in your J Connector JDBC driver.

MySQL does not "process" batches correctly (see my article below), but it can rewrite INSERT to use the fancy MySQL syntax, for example. you can tell the driver to rewrite two INSERTs:

 INSERT INTO (val1, val2) VALUES ('val1', 'val2'); INSERT INTO (val1, val2) VALUES ('val3', 'val4'); 

as one statement:

 INSERT INTO (val1, val2) VALUES ('val1', 'val2'), ('val3','val4'); 

(Note that I'm not saying that you need to rewrite your SQL in this way, the driver does this when possible)

We did this for our own study of insert insert: he made a difference in order of magnitude. Used with explicit transactions, as others have mentioned, and you will see a significant improvement overall.

Corresponding driver property parameter:

 jdbc:mysql:///<dbname>?rewriteBatchedStatements=true 

See: 10x performance increase for batch INSERTs with MySQL / J connector on the way

+1
source share

Wouldn't it be faster if you used LOAD DATA INFILE instead of inserting each row?

0
source share

I would do three threads ...

1) Reads the input file and pushes each line into the conversion queue 2) Jumps from the queue, converts data and inserts db into the queue 3) Sends db from the queue and inserts data

This way you can read data from disk while db threads are waiting for I / O to finish and vice versa

0
source share

If you haven’t already done so, try using the MyISAM table type, just be sure to read its flaws before you do this. This is usually faster than other types of tables.

If your table has indexes, it's usually faster to drop them and then add them back after import.

If your data is all rows, but better suited as a relational database, you would be better off inserting integers that indicate other values ​​rather than storing a long row.

But overall, yes, adding data to the database takes time.

0
source share
0
source share

All Articles