How to save a large array so that it takes up less memory in python?

Question

How to save a large array so that it takes up less memory in python?

I am new to python. I have a large array a with sizes such as (43200, 4000) , and I need to save it, since I need it for further processing. when I try to save it using np.savetxt , the txt file is too large and my program runs in memory since I need to process 5 files of the same size. Is there a way to save huge arrays in order to reduce the amount of memory?

Thanks.

+8

python numpy

user2766019 10 Sep '13 at 17:41

source share

2 answers

Saving data in a text file is extremely inefficient. Numpy has built-in save and savez / savez_compressed save commands , which would be much better for storing large arrays.

Depending on how you plan to use your data, you should also look in HDF5 format (h5py or pytables), which allows you to store large data sets without loading them all into memory.

+10

hackyday 10 Sep '13 at 18:18

source share

ely · Accepted Answer · 2013-09-10T18:20:20+0000

You can use PyTables to create a hierarchical data format (HDF) file for storing data. This gives some interesting parameters in memory that associate the object you are working with with the file in which it was saved.

Here's another StackOverflow question that demonstrates how to do this: How to store a multidimensional NumPy array in PyTables.

If you are ready to work with your array as a Pandas DataFrame object, you can also use the Pandas interface for PyTables / HDF5, for example:

 import pandas import numpy as np a = np.ones((43200, 4000)) # Not recommended. x = pandas.HDFStore("some_file.hdf") x.append("a", pandas.DataFrame(a)) # <-- This will take a while. x.close() # Then later on... my_data = pandas.HDFStore("some_file.hdf") # might also take a while usable_a_copy = my_data["a"] # Be careful of the way changes to # `usable_a_copy` affect the saved data. copy_as_nparray = usable_a_copy.values

With files of this size, you might wonder if your application can be executed using a parallel algorithm and potentially only apply to subsets of large arrays, and not to use the entire array before continuing.

How to save a large array so that it takes up less memory in python?

More articles: