SQLite import insert file: .import does one insert per row or group them with a transaction?

I import millions of lines from a tab file, and the SQLite.import.mode tabs are very slow. I have three indexes, so the slowness probably comes from indexing. But first, I would like to check that .import adds lines that group lots / all of them into one commit. I could not find the documentation of how .import works. Somebody knows? If the index is a problem (I had this problem with mysql), how can I disable it and reindex at the end of .import?

[Update 1]

Follow the comment by @sixfeetsix.

My scheme:

CREATE TABLE ensembl_vf_b36 ( variation_name varchar(20), chr varchar(4), start integer, end integer, strand varchar(5), allele_string varchar(3), map_weight varchar(2), flags varchar(50), validation_status varchar(100), consequence_type varchar(50) ); CREATE INDEX pos_vf_b36_idx on ensembl_vf_b36 (chr, start, end); 

Data:

 rs35701516 NT_113875 352 352 1 G/A 2 NULL NULL INTERGENIC rs12090193 NT_113875 566 566 1 G/A 2 NULL NULL INTERGENIC rs35448845 NT_113875 758 758 1 A/C 2 NULL NULL INTERGENIC rs17274850 NT_113875 1758 1758 1 G/A 2 genotyped cluster,freq INTERGENIC 

In this table 15_608_032

And this is statistics

  $ time sqlite3 -separator ' ' test_import.db '.import variations_build_36_ens-54.tab ensembl_vf_b36' real 29m27.643s user 4m14.176s sys 0m15.204s 

[Update 2]

@sixfeetsix has a good answer, and if you read this you are also interested

Faster bulk inserts in sqlite3?

Sqlite3: disabling primary key index on insertion?

[update3] Solution from 30 min → 4 min

Even with all the optimizations (see accepted answer) it still takes almost 30 minutes, but if the indexes are not used and added at the end, the total time is 4 minutes:

 -- importing without indexes: real 2m22.274s user 1m38.836s sys 0m4.850s -- adding indexes $ time sqlite3 ensembl-test-b36.db < add_indexes-b36.sql real 2m18.344s user 1m26.264s sys 0m6.422s 
+4
source share
1 answer

I believe that slowness actually comes from creating an index as more and more entries are added. Depending on the memory you have, you can tell sqlite to use enough memory so that all this index creation activity is performed in memory (i.e., without entering all the I / O that would otherwise have less memory).

For 15M entries, I would say that you have to set the cache size to 500000.

You can also say sqlite to keep the transaction log in memory.

Finally, you can set synchronously to OFF, so sqlite never expects records to be bound to disk.

Using this, I was able to split the time required to import 15M records by 5 (14 minutes to 2.5) with records from random GUIDs divided into 5 columns, using three middle columns as an index:

 b40c1c2f 912c 46c7 b7a0 3a7d8da724c1 9c1cdf2e e2bc 4c60 b29d e0a390abfd26 b9691a9b b0db 4f33 a066 43cb4f7cf873 01a360aa 9e2e 4643 ba1f 2aae3fd013a6 f1391f8b f32c 45f0 b137 b99e6c299528 

So, to try this, I suggest you put all the instructions in some file, say import_test:

 pragma journal_mode=memory; pragma synchronous=0; pragma cache_size=500000; .mode tabs .import variations_build_36_ens-54.tab ensembl_vf_b36 

Then try:

 time sqlite3 test_import.db < import_test 

EDIT

This is a response to Pablo's (OP) comments, following this answer (it has long been suitable for comments): My (educated) guesses are this:

  • Since .import is not sql per se, it is not very joking with transactions, I am even inclined to think that it is written faster than if you had all this done in one "normal" transaction; and,
  • If you have enough memory to allocate, and as I suggest, the real (time) hog here reads the apartment file, then records the final contents of the database, since it happens between quickly; that is, fast enough so that there is not much time to optimize it when you compare such a potential gain with (possibly) not compressible time spent on disk I / O.

If I am mistaken, I would be glad to hear why for my own good.

EDIT 2

I checked the benchmark between having an index during .import and adding it immediately after .import was completed. I used the same technique for generating a 15M record made from separate random UUIDs:

 import csv, uuid w = csv.writer(open('bla.tab', 'wb'), dialect='excel-tab') for i in xrange(15000000): w.writerow(str(uuid.uuid4()).split('-')) 

Then I tested the import with the index created before and after (here the index is created after):

 pragma journal_mode=memory; pragma synchronous=0; pragma cache_size=500000; create table test (f1 text, f2 text, f3 text, f4 text, f5 text); CREATE INDEX test_idx on test (f2, f3, f4); .mode tabs .import bla.tab test 

So here is the time to add the index before :

 [ someone@somewhere ~]$ time sqlite3 test_speed.sqlite < import_test memory real 2m58.839s user 2m21.411s sys 0m6.086s 

And when the index is added after :

 [ someone@somewhere ~]$ time sqlite3 test_speed.sqlite < import_test memory real 2m19.261s user 2m12.531s sys 0m4.403s 

Do you see how the difference in "user" differences (~ 9 s) does not take into account the time difference (~ 40 s)? I For me, this means that there are some additional I / O operations when the index is created earlier, and therefore I was mistaken in thinking that everything is done in memory without additional I / O operations.

Conclusion: create an index after and you will have even better import time (as Donal mentioned).

+6
source

All Articles