Pandas memoization

I have long calculations that I repeat many times. So I would like to use memoization (packages like jug and joblib ) in concert with Pandas . The problem is whether the package will memoize Pandas DataFrames well as method arguments.

Has anyone tried it? Is there any other recommended package / way to do this?

+7
source share
2 answers

The author of the jug is here: the jug works great. I just tried the following and it works:

from jug import TaskGenerator import pandas as pd import numpy as np @TaskGenerator def gendata(): return pd.DataFrame(np.arange(343440).reshape((10,-1))) @TaskGenerator def compute(x): return x.mean() y = compute(gendata()) 

It is not as efficient as it can be, because it just uses pickle inside for the DataFrame (although it compresses it "on the fly", therefore it is not terrible in terms of memory usage, but slower than it could be).

I would be open to a change that saves them as a special case, as the pitcher currently does for numpy arrays: https://github.com/luispedro/jug/blob/master/jug/backends/file_store.py#L102

+6
source

I use this basic memoized decorator of memories. http://wiki.python.org/moin/PythonDecoratorLibrary#Memoize

DataFrames are hashed, so it should work fine. Here is an example.

 In [2]: func = lambda df: df.apply(np.fft.fft) In [3]: memoized_func = memoized(func) In [4]: df = DataFrame(np.random.randn(1000, 1000)) In [5]: %timeit func(df) 10 loops, best of 3: 124 ms per loop In [9]: %timeit memoized_func(df) 1000000 loops, best of 3: 1.46 us per loop 

Looks nice.

+4
source

All Articles