Python sqlite3 and concurrency

I have a Python program that uses the threading module. Once per second, my program launches a new thread that retrieves some data from the Internet and stores this data on my hard drive. I would like to use sqlite3 to store these results, but I cannot get it to work. The problem is the following line:

conn = sqlite3.connect("mydatabase.db") 
  • If I put this line of code in each thread, I get an OperationalError telling me that the database file is locked. I assume that this means that another thread has mydatabase.db through sqlite3 connection and blocked it.
  • If I put this line of code in the main program and pass a connection object (conn) for each thread, I get a ProgrammingError, saying that SQLite objects created in the thread can only be used in the same thread.

I used to save all my results in CSV files and did not have any of these file locking issues. Hope this will be possible with sqlite. Any ideas?

+65
python sqlite
Dec 26 '08 at 6:51
source share
12 answers

You can use the template of the consumer manufacturer. For example, you can create a queue that is shared between threads. The first thread that retrieves data from the network places this data in a common queue. Another thread that owns the database connection deletes the data from the queue and passes it to the database.

+36
Dec 26 '08 at 7:10
source share

Contrary to popular belief, newer versions of sqlite3 do support multi-thread access.

This can be enabled with the optional check_same_thread keyword argument:

 sqlite.connect(":memory:", check_same_thread=False) 
+140
May 24 '10 at 5:03 a.m.
source share

Below is shown at mail.python.org.pipermail.1239789

I have found a solution. I don't know why there is not a word about this option in the python documentation. Therefore, we need to add a new keyword argument to the connection function and we can create cursors from it in different threads. Therefore use:

 sqlite.connect(":memory:", check_same_thread = False) 

works great for me. Of course, now I have to take care of secure multi-threaded db access. In any case, everyone is trying to help.

+14
Apr 05 2018-10-12T00:
source share

You should not use threads for this. This is a trivial task for twisted , and it will probably significantly affect you.

Use only one thread and complete the request to trigger an event for recording.

twisted will take care of scheduling, callbacks, etc. for you. It will give you the whole result as a string, or you can run it through a streaming processor (I have a twitter API and friendfeed API that both disables events for callers and the results are still loading).

Depending on what you are doing with your data, you can simply compile the full result in sqlite as it is completed, weld and dump, or cook it while it is reading, and unload it at the end.

I have a very simple application that does something close to what you want on github. I call it pfetch (parallel fetch). It captures various pages on a schedule, transfers the results to a file, and possibly runs a script after each of them completes successfully. It also does some fancy things, such as conditional GETs, but it can still be a good basis for what you do.

+13
Dec 26 '08 at 21:59
source share

Switch to multiprocessing . This is much better, scales well, can go beyond using multiple cores using multiple processors, and the interface is the same as using the python streaming module.

Or, as Ali suggested, just use the SQLAlchemy thread pooling mechanism . It will process everything automatically for you and has many additional features, just to quote some of them:

  • SQLAlchemy includes dialects for SQLite, Postgres, MySQL, Oracle, MS-SQL, Firebird, MaxDB, MS Access, Sybase, and Informix; IBM also released the DB2 driver. Therefore, you do not need to rewrite your application if you decide to move away from SQLite.
  • The Unit Of Work system, the central part of SQLAlchemy Object Relational Mapper (ORM), organizes pending create / insert / update / delete operations in a queue and resets them all in one batch. To do this, he performs a topological “sorting of dependencies” of all changed elements in the queue to comply with foreign key constraints, and also groups redundant statements together, where you can sometimes select even more. This ensures maximum efficiency and transaction security and minimizes the likelihood of locks.
+10
Dec 26 '08 at 16:51
source share

Or, if you are lazy, like me, you can use SQLAlchemy . It will handle the threads for you, ( using the local thread and some connection pool ), and how it does it is even configurable .

For an added bonus, if / when you understand / decide that using Sqlite for any parallel application will be a disaster, you will not have to change your code to use MySQL or Postgres or anything else. You can just switch.

+7
Dec 26 '08 at 18:31
source share

I like the answer of Eugene. Queues are usually the best way to implement cross-threading. For completeness, here are some other options:

  • Close the database connection when spawned threads have finished using it. This will fix your OperationalError , but opening and closing connections like this is usually a No-No due to overhead.
  • Do not use child threads. If the task once a second is quite easy, you can leave with the selection and saving, and then sleep until the right moment. This is undesirable, since fetching and storing operations can take> 1sec, and you lose the advantages of multiplexed resources that you use with a multi-threaded approach.
0
Dec 26 '08 at 12:51
source share

You need to create concurrency for your program. SQLite has clear limitations, and you need to follow them, see the FAQ (also the next question).

0
Dec 26 '08 at 19:32
source share

Scrapy seems like a potential answer to my question. Its homepage describes my exact task. (Although I'm not sure how stable the code is.)

0
Dec 28 '08 at 5:55
source share

I would look at the y_serial Python module for saving data: http://yserial.sourceforge.net

which handles deadlock problems associated with a single SQLite database. If the demand for concurrency becomes heavy, you can easily configure the Farm class from many databases to dissipate the load over stochastic time.

Hope this helps your project ... it should be simple enough to be implemented in 10 minutes.

0
Nov 28 '09 at 20:39
source share
0
Aug 19 '10 at 18:49
source share

The most likely reason you get errors with locked databases is because you should specify

 conn.commit() 

after the database operation is completed. If you do not, your database will be locked for writing and will remain so. Other threads that are waiting for recording will time out after some time (the default value is 5 seconds, see http://docs.python.org/2/library/sqlite3.html#sqlite3.connect for details on this).

An example of a correct and parallel insert is the following:

 import threading, sqlite3 class InsertionThread(threading.Thread): def __init__(self, number): super(InsertionThread, self).__init__() self.number = number def run(self): conn = sqlite3.connect('yourdb.db', timeout=5) conn.execute('CREATE TABLE IF NOT EXISTS threadcount (threadnum, count);') conn.commit() for i in range(1000): conn.execute("INSERT INTO threadcount VALUES (?, ?);", (self.number, i)) conn.commit() # create as many of these as you wish # but be careful to set the timeout value appropriately: thread switching in # python takes some time for i in range(2): t = InsertionThread(i) t.start() 

If you like SQLite or other tools that work with SQLite databases, or you want to replace CSV files with SQLite db files, or need to do something as rare as cross-platform IPC, then SQLite is a great tool and very suitable for the purpose. Do not allow yourself to put pressure on the use of another solution if it does not seem right!

-one
Nov 08 '13 at
source share