Export from / import to numpy, scipy in SQLite and HDF5 formats

Python seems to have many options for interacting with SQLite (sqlite3, atpy) and HDF5 (h5py, pyTables). I wonder if anyone has experience using this data with numpy arrays or data tables (structured / writable arrays), and which ones are most easily integrated with β€œscientific” modules (numpy, scipy) for each data format (SQLite and HDF5).

+11
python numpy scipy sqlite hdf5
Oct 25 2018-11-11T00:
source share
1 answer

Most of this depends on your use case.

I have much more experience with various methods based on HDF5 than with traditional relational databases, so I can not comment too much on SQLite libraries for python ...

At least until h5py vs pyTables , they both provide very easy access through numpy arrays, but they target very different use cases.

If you have n-dimensional data for which you want to quickly access an arbitrary fragment based on an index, then it is much easier to use h5py . If you have data similar to tables and want to query it, then pyTables is a much better option.

h5py is a relatively "vanilla" wrapper around HDF5 libraries compared to pyTables . This is very good if you intend to regularly access your HDF file from another language ( pyTables adds some additional metadata). h5py can do a lot, but for some use cases (for example, what pyTables does) you will need to spend more time setting it up.

pyTables has some really nice features. However, if your data is not very similar to a table, then this is probably not the best option.

To give a more concrete example, I work a lot with fairly large (tens of GB) 3 and 4-dimensional data arrays. They are homogeneous arrays of floats, ints, uint8, etc. Usually I want to access a small subset of the entire dataset. h5py makes this very simple and does a pretty good job of automatically guessing a reasonable block size. Capturing an arbitrary fragment or fragment from disk is much, much faster than for a simple memmapped file. (The emphasis is on arbitrary ... Obviously, if you want to capture the entire "X" fragment, then the C-ordered memmapped array cannot be beaten, since all the data in the "X" fragment is adjacent to disk.)

As a counter example, my wife collects data from a wide range of sensors that count from minutes to the second interval over several years. She needs to store and run arbitrary queries (and relatively simple calculations) based on her data. pyTables makes this use case very simple and fast and still has some advantages over traditional relational databases. (In particular, in terms of disk usage and the speed with which a large (index) piece of data can be read in memory)

+21
Oct. 25 '11 at 2:48
source share



All Articles