Storing and reloading large multidimensional datasets in Python

Question

Storing and reloading large multidimensional datasets in Python

I am going to run a large number of simulations that produce a large amount of data that needs to be saved and accessed later. The output from my simulation program is written to text files (one per simulation). I plan to write a Python program that reads these text files and then saves the data in a format more convenient for later analysis. After quite a lot of searching, I think I suffer from information overload, so I ask this question to Stack Overflow for some tips. Here are the details:

My data will basically take the form of a multidimensional array, where each record will look something like this:

data[ stringArg1, stringArg2, stringArg3, stringArg4, intArg1 ] = [ floatResult01, floatResult02, ..., floatResult12 ]

Each argument has approximately the following potential values:

stringArg1: 50

stringArg2: 20

stringArg3: 6

stringArg4: 24

intArg1: 10000

Note, however, that the data set will be sparse. For example, for a given stringArg1 value, only about 16 stringArg2 values will be filled. In addition, for this combination (stringArg1, stringArg2), approximately 5000 intArg1 values will be filled. The arguments of the third and fourth lines are always completely filled.

So, with these numbers, my array will have approximately 50 * 16 * 6 * 24 * 5000 = 576,000,000 result lists.

I am looking for the best way to save this array so that I can save it and reopen it later to either add more data, update existing data, or query existing data for analysis. So far I have studied three different approaches:

relational database
Pytables
Python, ( pickle )

, , (stringArg1, stringArg2, stringArg3, stringArg4, intArg1) , Python . (, ) , , . , , . , 2x2 = [[100, 200], [300, 400]], , [0] [1]. (0,0) (0,1) (1,0) (1,1) . .

, , PyTables, . , . stringArg1. . , stringArg2, - -...

( , ViTables ). , PyTables, , . , , .

, - , .

, , . . , . , .

+5

python multidimensional-array data-warehouse pytables

dbb 20 . '11 21:14

3

FrancescAlted · Answer 1 · 2011-02-23T17:40:40+0000

500 ? ( Blosc), , . ; : -)

ncoghlan · Answer 2 · 2011-02-21T03:37:32+0000

, 6 ?

. 1-5 , , , .

, 3- 4- , , 6- 3 (string1, string2, int1), string3 string4 .

Maimon · Answer 3 · 2011-07-21T19:33:09+0000

, , , () . , , , Numpy Numpy. Numpy

. . NumPy .

I have used Numpy many times to process simulation data and provided many useful tools, including easy file storage / access.

Hope you find something in it is very easy to read the documentation:

Documentation with examples .

Storing and reloading large multidimensional datasets in Python

More articles: