import numpy as np import pandas as pd n = 10 nrows = 33 index = pd.date_range('2000-1-1', periods=nrows, freq='D') df = pd.DataFrame(np.ones(nrows), index=index) print(df)
gives
0 2000-01-01 00:00:00 4 2000-01-04 07:12:00 3 2000-01-07 14:24:00 3 2000-01-10 21:36:00 4 2000-01-14 04:48:00 3 2000-01-17 12:00:00 3 2000-01-20 19:12:00 4 2000-01-24 02:24:00 3 2000-01-27 09:36:00 3 2000-01-30 16:48:00 3
The values ββin column 0 indicate the number of aggregated rows, since the original DataFrame was filled with values ββ1. Figure 4 and 3 are about the same as you can get, because 33 rows may not be evenly grouped into 10 groups.
The explanation . Consider this simpler DataFrame:
n = 2 nrows = 5 index = pd.date_range('2000-1-1', periods=nrows, freq='D') df = pd.DataFrame(np.ones(nrows), index=index)
Using df.resample('2D', how='sum') gives the wrong number of groups
In [366]: df.resample('2D', how='sum') Out[366]: 0 2000-01-01 2 2000-01-03 2 2000-01-05 1
Using df.resample('3D', how='sum') gives the correct number of groups, but the second group starts from 2000-01-04 , which does not evenly divide the DataFrame into two evenly distributed groups:
In [367]: df.resample('3D', how='sum') Out[367]: 0 2000-01-01 3 2000-01-04 2
To do better, we need to work with a more accurate time resolution than a few days. Since Timedelta has a total_seconds method, let it work in seconds. So, for the example above, the desired frequency will be
In [374]: df.resample('216000S', how='sum') Out[374]: 0 2000-01-01 00:00:00 3 2000-01-03 12:00:00 2
since within 5 days there are 216000 * 2 seconds:
In [373]: (pd.Timedelta(days=5) / pd.Timedelta('1S'))/2 Out[373]: 216000.0
Ok, now we just need to generalize this. We will need the minimum and maximum dates in the index:
first = df.index.min() last = df.index.max() + pd.Timedelta('1D')
We add an extra day because it makes a difference in days. In the example above, there are only 4 days between time stamps for 2000-01-05 and 2000-01-01.
In [377]: (pd.Timestamp('2000-01-05')-pd.Timestamp('2000-01-01')).days Out[378]: 4
But, as we see in the processed example, the DataFrame has 5 rows representing 5 days. Therefore, it makes sense to add an extra day.
Now we can calculate the correct number of seconds in each group with the same difference:
secs = int((last-first).total_seconds()