I believe that slowness actually comes from creating an index as more and more entries are added. Depending on the memory you have, you can tell sqlite to use enough memory so that all this index creation activity is performed in memory (i.e., without entering all the I / O that would otherwise have less memory).
For 15M entries, I would say that you have to set the cache size to 500000.
You can also say sqlite to keep the transaction log in memory.
Finally, you can set synchronously to OFF, so sqlite never expects records to be bound to disk.
Using this, I was able to split the time required to import 15M records by 5 (14 minutes to 2.5) with records from random GUIDs divided into 5 columns, using three middle columns as an index:
b40c1c2f 912c 46c7 b7a0 3a7d8da724c1 9c1cdf2e e2bc 4c60 b29d e0a390abfd26 b9691a9b b0db 4f33 a066 43cb4f7cf873 01a360aa 9e2e 4643 ba1f 2aae3fd013a6 f1391f8b f32c 45f0 b137 b99e6c299528
So, to try this, I suggest you put all the instructions in some file, say import_test:
pragma journal_mode=memory; pragma synchronous=0; pragma cache_size=500000; .mode tabs .import variations_build_36_ens-54.tab ensembl_vf_b36
Then try:
time sqlite3 test_import.db < import_test
EDIT
This is a response to Pablo's (OP) comments, following this answer (it has long been suitable for comments): My (educated) guesses are this:
- Since .import is not sql per se, it is not very joking with transactions, I am even inclined to think that it is written faster than if you had all this done in one "normal" transaction; and,
- If you have enough memory to allocate, and as I suggest, the real (time) hog here reads the apartment file, then records the final contents of the database, since it happens between quickly; that is, fast enough so that there is not much time to optimize it when you compare such a potential gain with (possibly) not compressible time spent on disk I / O.
If I am mistaken, I would be glad to hear why for my own good.
EDIT 2
I checked the benchmark between having an index during .import and adding it immediately after .import was completed. I used the same technique for generating a 15M record made from separate random UUIDs:
import csv, uuid w = csv.writer(open('bla.tab', 'wb'), dialect='excel-tab') for i in xrange(15000000): w.writerow(str(uuid.uuid4()).split('-'))
Then I tested the import with the index created before and after (here the index is created after):
pragma journal_mode=memory; pragma synchronous=0; pragma cache_size=500000; create table test (f1 text, f2 text, f3 text, f4 text, f5 text); CREATE INDEX test_idx on test (f2, f3, f4); .mode tabs .import bla.tab test
So here is the time to add the index before :
[ someone@somewhere ~]$ time sqlite3 test_speed.sqlite < import_test memory real 2m58.839s user 2m21.411s sys 0m6.086s
And when the index is added after :
[ someone@somewhere ~]$ time sqlite3 test_speed.sqlite < import_test memory real 2m19.261s user 2m12.531s sys 0m4.403s
Do you see how the difference in "user" differences (~ 9 s) does not take into account the time difference (~ 40 s)? I For me, this means that there are some additional I / O operations when the index is created earlier, and therefore I was mistaken in thinking that everything is done in memory without additional I / O operations.
Conclusion: create an index after and you will have even better import time (as Donal mentioned).