Pandas' equivalent conversion for integer index

I am looking for the pandas equivalent of the resample method for a data frame that is not a DatetimeIndex , but is an array of integers, or perhaps even floats.

I know that in some cases ( this one , for example), the resample method can be easily replaced by reindexing and interpolation, but for some cases (I think) it cannot.

For example, if I have

 df = pd.DataFrame(np.random.randn(10,2)) withdates = df.set_index(pd.date_range('2012-01-01', periods=10)) withdates.resample('5D', np.std) 

it gives me

  0 1 2012-01-01 1.184582 0.492113 2012-01-06 0.533134 0.982562 

but i cant get the same result with df and resample. So I'm looking for something that will work like

  df.resample(5, np.std) 

and it will give me

  0 1 0 1.184582 0.492113 5 0.533134 0.982562 

Is there such a method? The only way I was able to create this method was to manually split df into smaller data frames using np.std and then concatenating everything back, which I find pretty slow and not smart at all.

Greetings

+6
source share
3 answers

Customization

 import pandas as pd import numpy as np np.random.seed([3,1415]) df = pd.DataFrame(np.random.rand(20, 2), columns=['A', 'B']) 

You need to create shortcuts for grouping yourself. I would use:

 (df.index.to_series() / 5).astype(int) 

To get a series of values, such as [0, 0, 0, 0, 0, 1, 1, 1, 1, 1, ...] Then use this in groupby

You will also need to specify an index for the new framework. I would use:

 df.index[4::5] 

To get the current index, starting at 5th position (hence 4 ) and every fifth position after that. It will look like [4, 9, 14, 19] . I could do this as df.index[::5] to get the starting positions, but I went with the ending positions.

Decision

 # assign as variable because I'm going to use it more than once. s = (df.index.to_series() / 5).astype(int) df.groupby(s).std().set_index(s.index[4::5]) 

It looks like:

  AB 4 0.198019 0.320451 9 0.329750 0.408232 14 0.293297 0.223991 19 0.095633 0.376390 

Other considerations

This is equivalent to fetching down. We did not consider the sample.

To get back to what we did with the dataframe index something more frequent, we can use reindex as follows:

 # assign what we've done above to df_down df_down = df.groupby(s).std().set_index(s.index[4::5]) df_up = df_down.reindex(range(20)).bfill() 

It looks like:

  AB 0 0.198019 0.320451 1 0.198019 0.320451 2 0.198019 0.320451 3 0.198019 0.320451 4 0.198019 0.320451 5 0.329750 0.408232 6 0.329750 0.408232 7 0.329750 0.408232 8 0.329750 0.408232 9 0.329750 0.408232 10 0.293297 0.223991 11 0.293297 0.223991 12 0.293297 0.223991 13 0.293297 0.223991 14 0.293297 0.223991 15 0.095633 0.376390 16 0.095633 0.376390 17 0.095633 0.376390 18 0.095633 0.376390 19 0.095633 0.376390 

We could also use other things for reindex , for example, range(0, 20, 2) , to get samples to even indices.

+3
source

Alternative, this is one thing you can do

 def resample(df, rule, how=None, **kwargs): import pandas as pd if how==None: import numpy as np how = np.mean if isinstance(df.index, pd.DatetimeIndex) and isinstance(rule, str): return df.resample(rule, how, **kwargs) else: idx, bins = pd.cut(df.index, range(df.index[0], df.index[-1]+2, rule), right=False, retbins=True) aux = df.groupby(idx).apply(how) aux = aux.set_index(bins[:-1]) return aux 
+1
source

The @piSquared solution is really nice, but I don't like picking an index on hand when reindexing.

This should also work for each type of downsampling (float index too) and automatically select the average index value in each range:

 df = pd.DataFrame(index = np.random.rand(20)*30, data=np.random.rand(20, 2), columns=['A', 'B']) df.index.name = 'crazy_index' s = (df.index.to_series() / 10).astype(int) 

Now you can choose the function that you want to calculate in each subgroup as you wish:

 # calculate std() in each group df.groupby(s).mean().set_index( s.groupby(s).apply(lambda x: np.mean(x.index)) ) AB crazy_index 3.667539 0.276986 0.317642 14.275074 0.248700 0.372551 25.054042 0.254860 0.297586 # calculate median() in each group df.groupby(s).median().set_index( s.groupby(s).apply(lambda x: np.mean(x.index)) ) Out[38]: AB crazy_index 3.667539 0.454654 0.521649 14.275074 0.451265 0.490125 25.054042 0.489326 0.622781 

EDIT: There were some errors in indexing s, now this is working correctly.

+1
source

All Articles