Storing and reloading large multidimensional datasets in Python

I am going to run a large number of simulations that produce a large amount of data that needs to be saved and accessed later. The output from my simulation program is written to text files (one per simulation). I plan to write a Python program that reads these text files and then saves the data in a format more convenient for later analysis. After quite a lot of searching, I think I suffer from information overload, so I ask this question to Stack Overflow for some tips. Here are the details:

My data will basically take the form of a multidimensional array, where each record will look something like this:

data[ stringArg1, stringArg2, stringArg3, stringArg4, intArg1 ] = [ floatResult01, floatResult02, ..., floatResult12 ]

Each argument has approximately the following potential values:

stringArg1: 50

stringArg2: 20

stringArg3: 6

stringArg4: 24

intArg1: 10000

Note, however, that the data set will be sparse. For example, for a given stringArg1 value, only about 16 stringArg2 values ​​will be filled. In addition, for this combination (stringArg1, stringArg2), approximately 5000 intArg1 values ​​will be filled. The arguments of the third and fourth lines are always completely filled.

So, with these numbers, my array will have approximately 50 * 16 * 6 * 24 * 5000 = 576,000,000 result lists.

I am looking for the best way to save this array so that I can save it and reopen it later to either add more data, update existing data, or query existing data for analysis. So far I have studied three different approaches:

  • relational database

  • Pytables

  • Python, ( pickle )

, , (stringArg1, stringArg2, stringArg3, stringArg4, intArg1) , Python . (, ) , , . , , . , 2x2 = [[100, 200], [300, 400]], , [0] [1]. (0,0) (0,1) (1,0) (1,1) . .

, , PyTables, . , . stringArg1. . , stringArg2, - -...

( , ViTables ). , PyTables, , . , , .

, - , .

, , . . , . , .

+5
3

500 ? ( Blosc), , . ; : -)

+2

, 6 ?

. 1-5 , , , .

, 3- 4- , , 6- 3 (string1, string2, int1), string3 string4 .

0

, , , () . , , , Numpy Numpy. Numpy

. . NumPy .

I have used Numpy many times to process simulation data and provided many useful tools, including easy file storage / access.

Hope you find something in it is very easy to read the documentation:

Documentation with examples .

0
source

All Articles