HDFStore: table.select and RAM usage

I am trying to select random rows from an HDFStore table of about 1 GB. RAM usage explodes when I request about 50 random strings.

I am using pandas 0-11-dev, python 2.7, linux64 .

In this first case, using RAM matches the size of the chunk

 with pd.get_store("train.h5",'r') as train: for chunk in train.select('train',chunksize=50): pass 

In this second case, it seems that the entire table is loaded into RAM

 r=random.choice(400000,size=40,replace=False) train.select('train',pd.Term("index",r)) 

In this latter case, the use of RAM corresponds to the equivalent chunk size

 r=random.choice(400000,size=30,replace=False) train.select('train',pd.Term("index",r)) 

I am puzzled why moving from 30 to 40 random strings causes such a sharp increase in RAM usage.

Note that the table was indexed at creation so that index = range (nrows (table)) using the following code:

 def txtfile2hdfstore(infile, storefile, table_name, sep="\t", header=0, chunksize=50000 ): max_len, dtypes0 = txtfile2dtypes(infile, sep, header, chunksize) with pd.get_store( storefile,'w') as store: for i, chunk in enumerate(pd.read_table(infile,header=header,sep=sep,chunksize=chunksize, dtype=dict(dtypes0))): chunk.index= range( chunksize*(i), chunksize*(i+1))[:chunk.shape[0]] store.append(table_name,chunk, min_itemsize={'values':max_len}) 

Thank you for understanding

EDIT ANSWER Zelazny7

Here is the file I used to write Train.csv to train.h5. I wrote this using the Zelazny7 code elements from How to eliminate the HDFStore exception: cannot find the correct atom type

 import pandas as pd import numpy as np from sklearn.feature_extraction import DictVectorizer def object_max_len(x): if x.dtype != 'object': return else: return len(max(x.fillna(''), key=lambda x: len(str(x)))) def txtfile2dtypes(infile, sep="\t", header=0, chunksize=50000 ): max_len = pd.read_table(infile,header=header, sep=sep,nrows=5).apply( object_max_len).max() dtypes0 = pd.read_table(infile,header=header, sep=sep,nrows=5).dtypes for chunk in pd.read_table(infile,header=header, sep=sep, chunksize=chunksize): max_len = max((pd.DataFrame(chunk.apply( object_max_len)).max(),max_len)) for i,k in enumerate(zip( dtypes0[:], chunk.dtypes)): if (k[0] != k[1]) and (k[1] == 'object'): dtypes0[i] = k[1] #as of pandas-0.11 nan requires a float64 dtype dtypes0.values[dtypes0 == np.int64] = np.dtype('float64') return max_len, dtypes0 def txtfile2hdfstore(infile, storefile, table_name, sep="\t", header=0, chunksize=50000 ): max_len, dtypes0 = txtfile2dtypes(infile, sep, header, chunksize) with pd.get_store( storefile,'w') as store: for i, chunk in enumerate(pd.read_table(infile,header=header,sep=sep,chunksize=chunksize, dtype=dict(dtypes0))): chunk.index= range( chunksize*(i), chunksize*(i+1))[:chunk.shape[0]] store.append(table_name,chunk, min_itemsize={'values':max_len}) 

Used as

 txtfile2hdfstore('Train.csv','train.h5','train',sep=',') 
+6
source share
1 answer

This is a known issue, see the link here: https://github.com/pydata/pandas/pull/2755

Essentially, the query is converted to a numexpr expression for evaluation. There is a problem where I cannot pass many or conditions to numexpr (it depends on the total length of the generated expression).

So I just limit the expression that we pass to numexpr. If it exceeds a certain number of or conditions, the query is executed as a filter, and not in the kernel. This basically means that the table is read and then re-indexed.

This is a list of my improvements: https://github.com/pydata/pandas/issues/2391 (17).

As a workaround, just split your queries into several and follow the results. Must be much faster and use a constant amount of memory

+6
source

All Articles