What simple database is used with both Python and Matlab?

I need to manipulate a lot of numerical / textual data, say, just 10 billion records, which theoretically can be organized as 1000 out of 10000 * 1000 tables. Most calculations need to be done on a small subset of the data each time (specific rows or columns), so I don’t need all the data at once.

Therefore, I try to store the data in some kind of database, so I can easily search the database, retrieve several rows / columns that meet certain criteria, do some calculations and update the database. The database should be accessible with both Python and Matlab, where I use Python mainly to create the source data and enter it into the database and Matlab to process the data.

The whole project runs on Windows 7. What is the best and mostly simple database I can use for this purpose? I have no experience with databases.

+4
source share
4 answers

I would recommend SQLite . The default Python installation already has bindings for it.

To use SQLite Windows installation, follow these steps:

To create a database, you can do something like (from sqlite3 documentation):

import sqlite3 conn = sqlite3.connect('example.db') c = conn.cursor() # Create table c.execute('''CREATE TABLE stocks (date text, trans text, symbol text, qty real, price real)''') # Insert a row of data c.execute("INSERT INTO stocks VALUES ('2006-01-05','BUY','RHAT',100,35.14)") # Save (commit) the changes conn.commit() # We can also close the cursor if we are done with it c.close() 

And for import into Matlab you can use mksqlite .

For more information, you can check: http://labrosa.ee.columbia.edu/millionsong/pages/sqlite-interfaces-python-and-matlab

+7
source

IMO just uses a file system with a file format that you can read / write in both MATLAB and Python. Databases usually imply a relational model (excluding No-SQL), which would only add complexity here.

Being more apt to MATLAB, you can directly manipulate MAT files in SciPy using the scipy.io.loadmat / scipy.io.savemat . This is the native MATLAB format for storing data with save / load functions.

Unless, of course, you need databases, then ignore my answer :)

+3
source

SQLite is easy to configure, but I had no problems with MySQL. Connectors are available and work quite smoothly.

http://www.mathworks.com/matlabcentral/fileexchange/8663-mysql-database-connector

I have a similar project working, where I use Matlab for extraction and analysis, and Ruby on Rails for publishing a lot of stock market data. Using very large datasets and this solution seems to work well. Historically, SQLite3 does not work the same as MySQL or PostgreSQL for large datasets, so I recommend switching.

+2
source

PostgreSQL Benefits

If you need to process more complex data types (e.g. arrays), it is reasonable, IMHO, to use PostgreSQL. On the one hand, it allows you to store much more complex types than SQLite. On the other hand (unlike some relationship databases such as MySQL) PostgreSQL is completely ACID . In short, PostgreSQL is a good choice for highly structured data coming in tabular form and requiring more complex data types, such as arrays. Last, but not least, PostgreSQL is free, open source software developed by an international team of several companies as well as individual members.

Python for PostgreSQL

Regarding access to PostgreSQL from Python, there are several Python drivers for PostgreSQL, for example, Psycopg2 or PyGreSQL (you can look in some lists of such drivers here: https://wiki.postgresql.org/wiki/Python ).

Matlab PostgreSQL Connectors and Their Performance

What about the matching connectors from Matlab, there are several solutions. First of all, you can use the standard Matlab Database Toolbox working with PostgeSQL through a direct JDBC connection. But the Matlab Database Toolbox has some hidden limitations regarding the performance, amount, and type of data to be processed. For example, it is almost impossible to use it for arrays or for substantially large amounts of data (about 1 GB or more). You can use JDBC directly from Matlab (for arrays you can use, for example, the dbarray package ). But IMHO is rather slow and often results in a lack of Java heap memory for the case of big data (and a simple increase in Java heap memory size cannot be a panacea). Thus, these methods are only good if you need to process a relatively small amount of data when the execution of this part is not critical. Other solutions are based on libpq. For example, there is a free mexPostgres package written in C ++. This library analyzes data based on its textual representation (via the PQgetvalue function from libpq) and only for a very limited list of data types (in fact, they are scalar numbers and logics, times, dates, timestamps and intervals, as well as strings, arrays again go out of volume). And finally, there is another commercial solution. This is a high-performance PostgreSQL client library written 100% in C and based on libpq called PgMex . The main (but not the only) difference of PgMex from mexPostgres (despite the fact that both libraries are based on libpq) is that PgMex provides binary data transfer between Matlab and PostgreSQL without any parsing of the text. At the same time, everything is done in a Matlab-friendly and native way (in the form of matrices, multidimensional arrays, structures and arbitrary other Matlab formats). In terms of performance, this can be seen in the following figures comparing the data insert for the Matlab Database Toolbox and PgMex (in terms of data extraction, preliminary results show that PgMex is about 3.5 faster than the Matlab Database Toolbox for the simplest case of scalar numeric data ):

The case of scalar numeric data Array Case

Here the performance of fastinsert and datainsert from the Matlab Database Toolbox is compared with the batchParamExec from PgMex (see https://pgmex.alliedtesting.com/#batchparamexec for details). The first image refers to scalar numerical data, the second refers to arrays. The endpoint of each graph corresponds to a certain maximum amount of data transferred to the database by the appropriate method without any error. The amount of data that exceeds this maximum (specific for each method) causes the "out of Java heap memory" problem (The size of the Java heap for each experiment is indicated at the top of each figure). For more information about experiments, see the following paper with complete benchmarking results for inserting data .

+2
source

All Articles