Store a lot of data inside python

Question

Store a lot of data inside python

Perhaps I will start a small introduction for my problem. I am writing a python program that will be used for the subsequent processing of various physical simulations. Each simulation can create up to 100 GB of output. I deal with various information (for example, positions, fields and densities, ...) for different time steps. I would like to have access to all this data at once, which is impossible because I do not have enough memory in my system. I usually use a read file and then do some operations and clear the memory. Then I read other data and do some operations and clear the memory.

Now my problem. If I do this, I will spend a lot of time reading data more than once. It takes a lot of time. I would like to read it only once and save it for easy access. Do you know a method for storing large amounts of data that is really fast or that does not require a lot of space.

I just created a method that is about ten times faster than normal normal open reading. But for this I use cat (linux command). This is a really dirty method and I would like to kick it out of my script.

Can databases be used to store this data and retrieve data faster than normal reading? (sorry for this question, but I'm not a computer scientist, and I don't have a lot of knowledge behind databases).

EDIT:

My code code looks something like this: just an example:

 out = string.split(os.popen("cat "+base+"phs/phs01_00023_"+time).read()) # and if I want to have this data as arrays then I normally use and reshape (if I # need it) out = array(out) out = reshape(out)

Normally I would use the numpy numpy.loadtxt method, which should have the same time as normal reading .:

 f = open('filename') f.read() ...

I think loadtxt just use regular methods with some extra lines of code.

I know that there are some better ways to read data. But all I found was very slow. Now I will try mmap and hope that I will have better performance.

+6

python

ahelm Mar 10 '11 at 18:43

source share

3 answers

I would try using HDF5 . There are two commonly used Python interfaces, h5py and PyTables . Although the latter seems more common, I prefer the former.

+7

Sven marnach Mar 10 '11 at 18:47

source share

If you work with large datasets, Python may not be the best choice. If you want to use a database such as MySQL or Postgres, you should try SQLAlchemy . This makes it easy to work with potentially large datasets using small Python objects. For example, if you use a type definition:

 from datetime import datetime from sqlalchemy import Column, DateTime, Enum, ForeignKey, Integer, \ MetaData, PickleType, String, Text, Table, LargeBinary from sqlalchemy.ext.declarative import declarative_base from sqlalchemy.orm import column_property, deferred, object_session, \ relation, backref SqlaBaseClass = declarative_base() class MyDataObject(SqlaBaseClass): __tablename__ = 'datarows' eltid = Column(Integer, primary_key=True) name = Column(String(50, convert_unicode=True), nullable=False, unique=True, index=True) created = Column(DateTime) updated = Column(DateTime, default=datetime.today) mylargecontent = deferred(Column(LargeBinary)) def __init__(self, name): self.name = name self.created = datetime.today() def __repr__(self): return "<MyDataObject name='%s'>" %(self.name,)

Then you can easily access all rows with small data objects:

 # set up database connection; open dbsession; ... for elt in dbsession.query(MyDataObject).all(): print elt.eltid # does not access mylargecontent if (something(elt)): process(elt.mylargecontent) # now large binary is pulled from db server # on demand

I think the fact is that you can add as many fields to your data as you want by adding indexes necessary to speed up the search. And most importantly, when you work with MyDataObject , you can create potentially large deferred fields so that they only load when you need them.

0

phooji Mar 10 '11 at 18:47

source share

Greg hewgill · Accepted Answer · 2011-03-10T18:52:47+0000

If you are running a 64-bit operating system, you can use the mmap module to map this entire file to memory space. Then reading random bits of data can be done much faster, since the OS is responsible for managing access patterns. Please note that in fact you do not need 100 GB of RAM, as the OS will manage all this in virtual memory.

I did this with a 30 gigabyte file (a dump of the Wikipedia XML article) on 64-bit FreeBSD 8 with very good results.

Store a lot of data inside python

More articles: