How to reformat df with datetime index to exactly n uniform periods?

I have a large data framework with a datetime index and you need to reset the data to exactly 10 identical periods.

So far I have tried to find the first and last dates to determine the total number of days in the data, divide it by 10 to determine the size of each period, and then repeat the selection using this number of days. eg:

first = df.reset_index().timesubmit.min() last = df.reset_index().timesubmit.max() periodsize = str((last-first).days/10) + 'D' df.resample(periodsize,how='sum') 

This does not guarantee exactly 10 periods in df after re-fetching, since the time period is a rounded down int. Using a float does not work in reseeding. It seems that there is something simple that I am missing here, or I am attacking a problem, everything is wrong.

+7
python pandas
source share
2 answers
 import numpy as np import pandas as pd n = 10 nrows = 33 index = pd.date_range('2000-1-1', periods=nrows, freq='D') df = pd.DataFrame(np.ones(nrows), index=index) print(df) # 0 # 2000-01-01 1 # 2000-01-02 1 # ... # 2000-02-01 1 # 2000-02-02 1 first = df.index.min() last = df.index.max() + pd.Timedelta('1D') secs = int((last-first).total_seconds()//n) periodsize = '{:d}S'.format(secs) result = df.resample(periodsize, how='sum') print('\n{}'.format(result)) assert len(result) == n 

gives

  0 2000-01-01 00:00:00 4 2000-01-04 07:12:00 3 2000-01-07 14:24:00 3 2000-01-10 21:36:00 4 2000-01-14 04:48:00 3 2000-01-17 12:00:00 3 2000-01-20 19:12:00 4 2000-01-24 02:24:00 3 2000-01-27 09:36:00 3 2000-01-30 16:48:00 3 

The values ​​in column 0 indicate the number of aggregated rows, since the original DataFrame was filled with values ​​1. Figure 4 and 3 are about the same as you can get, because 33 rows may not be evenly grouped into 10 groups.


The explanation . Consider this simpler DataFrame:

 n = 2 nrows = 5 index = pd.date_range('2000-1-1', periods=nrows, freq='D') df = pd.DataFrame(np.ones(nrows), index=index) # 0 # 2000-01-01 1 # 2000-01-02 1 # 2000-01-03 1 # 2000-01-04 1 # 2000-01-05 1 

Using df.resample('2D', how='sum') gives the wrong number of groups

 In [366]: df.resample('2D', how='sum') Out[366]: 0 2000-01-01 2 2000-01-03 2 2000-01-05 1 

Using df.resample('3D', how='sum') gives the correct number of groups, but the second group starts from 2000-01-04 , which does not evenly divide the DataFrame into two evenly distributed groups:

 In [367]: df.resample('3D', how='sum') Out[367]: 0 2000-01-01 3 2000-01-04 2 

To do better, we need to work with a more accurate time resolution than a few days. Since Timedelta has a total_seconds method, let it work in seconds. So, for the example above, the desired frequency will be

 In [374]: df.resample('216000S', how='sum') Out[374]: 0 2000-01-01 00:00:00 3 2000-01-03 12:00:00 2 

since within 5 days there are 216000 * 2 seconds:

 In [373]: (pd.Timedelta(days=5) / pd.Timedelta('1S'))/2 Out[373]: 216000.0 

Ok, now we just need to generalize this. We will need the minimum and maximum dates in the index:

 first = df.index.min() last = df.index.max() + pd.Timedelta('1D') 

We add an extra day because it makes a difference in days. In the example above, there are only 4 days between time stamps for 2000-01-05 and 2000-01-01.

 In [377]: (pd.Timestamp('2000-01-05')-pd.Timestamp('2000-01-01')).days Out[378]: 4 

But, as we see in the processed example, the DataFrame has 5 rows representing 5 days. Therefore, it makes sense to add an extra day.

Now we can calculate the correct number of seconds in each group with the same difference:

 secs = int((last-first).total_seconds()//n) 
+1
source share

Here is one way to ensure subperiods of equal size using np.linspace() on pd.Timedelta , and then classifying each crawl into different cells with pd.cut .

 import pandas as pd import numpy as np # generate artificial data np.random.seed(0) df = pd.DataFrame(np.random.randn(100, 2), columns=['A', 'B'], index=pd.date_range('2015-01-01 00:00:00', periods=100, freq='8H')) Out[87]: AB 2015-01-01 00:00:00 1.7641 0.4002 2015-01-01 08:00:00 0.9787 2.2409 2015-01-01 16:00:00 1.8676 -0.9773 2015-01-02 00:00:00 0.9501 -0.1514 2015-01-02 08:00:00 -0.1032 0.4106 2015-01-02 16:00:00 0.1440 1.4543 2015-01-03 00:00:00 0.7610 0.1217 2015-01-03 08:00:00 0.4439 0.3337 2015-01-03 16:00:00 1.4941 -0.2052 2015-01-04 00:00:00 0.3131 -0.8541 2015-01-04 08:00:00 -2.5530 0.6536 2015-01-04 16:00:00 0.8644 -0.7422 2015-01-05 00:00:00 2.2698 -1.4544 2015-01-05 08:00:00 0.0458 -0.1872 2015-01-05 16:00:00 1.5328 1.4694 ... ... ... 2015-01-29 08:00:00 0.9209 0.3187 2015-01-29 16:00:00 0.8568 -0.6510 2015-01-30 00:00:00 -1.0342 0.6816 2015-01-30 08:00:00 -0.8034 -0.6895 2015-01-30 16:00:00 -0.4555 0.0175 2015-01-31 00:00:00 -0.3540 -1.3750 2015-01-31 08:00:00 -0.6436 -2.2234 2015-01-31 16:00:00 0.6252 -1.6021 2015-02-01 00:00:00 -1.1044 0.0522 2015-02-01 08:00:00 -0.7396 1.5430 2015-02-01 16:00:00 -1.2929 0.2671 2015-02-02 00:00:00 -0.0393 -1.1681 2015-02-02 08:00:00 0.5233 -0.1715 2015-02-02 16:00:00 0.7718 0.8235 2015-02-03 00:00:00 2.1632 1.3365 [100 rows x 2 columns] # cutoff points, 10 equal-size group requires 11 points # measured by timedelta 1 hour time_delta_in_hours = (df.index - df.index[0]) / pd.Timedelta('1h') n = 10 ts_cutoff = np.linspace(0, time_delta_in_hours[-1], n+1) # labels, time index time_index = df.index[0] + np.array([pd.Timedelta(str(time_delta)+'h') for time_delta in ts_cutoff]) # create a categorical reference variables df['start_time_index'] = pd.cut(time_delta_in_hours, bins=10, labels=time_index[:-1]) # for clarity, reassign labels using end-period index df['end_time_index'] = pd.cut(time_delta_in_hours, bins=10, labels=time_index[1:]) Out[89]: AB start_time_index end_time_index 2015-01-01 00:00:00 1.7641 0.4002 2015-01-01 00:00:00 2015-01-04 07:12:00 2015-01-01 08:00:00 0.9787 2.2409 2015-01-01 00:00:00 2015-01-04 07:12:00 2015-01-01 16:00:00 1.8676 -0.9773 2015-01-01 00:00:00 2015-01-04 07:12:00 2015-01-02 00:00:00 0.9501 -0.1514 2015-01-01 00:00:00 2015-01-04 07:12:00 2015-01-02 08:00:00 -0.1032 0.4106 2015-01-01 00:00:00 2015-01-04 07:12:00 2015-01-02 16:00:00 0.1440 1.4543 2015-01-01 00:00:00 2015-01-04 07:12:00 2015-01-03 00:00:00 0.7610 0.1217 2015-01-01 00:00:00 2015-01-04 07:12:00 2015-01-03 08:00:00 0.4439 0.3337 2015-01-01 00:00:00 2015-01-04 07:12:00 2015-01-03 16:00:00 1.4941 -0.2052 2015-01-01 00:00:00 2015-01-04 07:12:00 2015-01-04 00:00:00 0.3131 -0.8541 2015-01-01 00:00:00 2015-01-04 07:12:00 2015-01-04 08:00:00 -2.5530 0.6536 2015-01-04 07:12:00 2015-01-07 14:24:00 2015-01-04 16:00:00 0.8644 -0.7422 2015-01-04 07:12:00 2015-01-07 14:24:00 2015-01-05 00:00:00 2.2698 -1.4544 2015-01-04 07:12:00 2015-01-07 14:24:00 2015-01-05 08:00:00 0.0458 -0.1872 2015-01-04 07:12:00 2015-01-07 14:24:00 2015-01-05 16:00:00 1.5328 1.4694 2015-01-04 07:12:00 2015-01-07 14:24:00 ... ... ... ... ... 2015-01-29 08:00:00 0.9209 0.3187 2015-01-27 09:36:00 2015-01-30 16:48:00 2015-01-29 16:00:00 0.8568 -0.6510 2015-01-27 09:36:00 2015-01-30 16:48:00 2015-01-30 00:00:00 -1.0342 0.6816 2015-01-27 09:36:00 2015-01-30 16:48:00 2015-01-30 08:00:00 -0.8034 -0.6895 2015-01-27 09:36:00 2015-01-30 16:48:00 2015-01-30 16:00:00 -0.4555 0.0175 2015-01-27 09:36:00 2015-01-30 16:48:00 2015-01-31 00:00:00 -0.3540 -1.3750 2015-01-30 16:48:00 2015-02-03 00:00:00 2015-01-31 08:00:00 -0.6436 -2.2234 2015-01-30 16:48:00 2015-02-03 00:00:00 2015-01-31 16:00:00 0.6252 -1.6021 2015-01-30 16:48:00 2015-02-03 00:00:00 2015-02-01 00:00:00 -1.1044 0.0522 2015-01-30 16:48:00 2015-02-03 00:00:00 2015-02-01 08:00:00 -0.7396 1.5430 2015-01-30 16:48:00 2015-02-03 00:00:00 2015-02-01 16:00:00 -1.2929 0.2671 2015-01-30 16:48:00 2015-02-03 00:00:00 2015-02-02 00:00:00 -0.0393 -1.1681 2015-01-30 16:48:00 2015-02-03 00:00:00 2015-02-02 08:00:00 0.5233 -0.1715 2015-01-30 16:48:00 2015-02-03 00:00:00 2015-02-02 16:00:00 0.7718 0.8235 2015-01-30 16:48:00 2015-02-03 00:00:00 2015-02-03 00:00:00 2.1632 1.3365 2015-01-30 16:48:00 2015-02-03 00:00:00 [100 rows x 4 columns] df.groupby('start_time_index').agg('sum') Out[90]: AB start_time_index 2015-01-01 00:00:00 8.6133 2.7734 2015-01-04 07:12:00 1.9220 -0.8069 2015-01-07 14:24:00 -8.1334 0.2318 2015-01-10 21:36:00 -2.7572 -4.2862 2015-01-14 04:48:00 1.1957 7.2285 2015-01-17 12:00:00 3.2485 6.6841 2015-01-20 19:12:00 -0.8903 2.2802 2015-01-24 02:24:00 -2.1025 1.3800 2015-01-27 09:36:00 -1.1017 1.3108 2015-01-30 16:48:00 -0.0902 -2.5178 

Another potential shorter way to do this is to specify the sampling rate as the delta time. But the problem, as shown below, is that it provides 11 subsamples, not 10. I believe that the reason is that resample implements the left-inclusive/right-exclusive (or left-exclusive/right-inclusive) sub left-inclusive/right-exclusive (or left-exclusive/right-inclusive) sampling scheme) left-inclusive/right-exclusive (or left-exclusive/right-inclusive) , so the most recent obs in 2015 - 02-03 00:00:00 'is considered as a separate group. If we use pd.cut to do this on our own, we can specify include_lowest=True so that it gives us exactly 10 subsamples, not 11.

 n = 10 time_delta_str = str((df.index[-1] - df.index[0]) / (pd.Timedelta('1s') * n)) + 's' df.resample(pd.Timedelta(time_delta_str), how='sum') Out[114]: AB 2015-01-01 00:00:00 8.6133 2.7734 2015-01-04 07:12:00 1.9220 -0.8069 2015-01-07 14:24:00 -8.1334 0.2318 2015-01-10 21:36:00 -2.7572 -4.2862 2015-01-14 04:48:00 1.1957 7.2285 2015-01-17 12:00:00 3.2485 6.6841 2015-01-20 19:12:00 -0.8903 2.2802 2015-01-24 02:24:00 -2.1025 1.3800 2015-01-27 09:36:00 -1.1017 1.3108 2015-01-30 16:48:00 -2.2534 -3.8543 2015-02-03 00:00:00 2.1632 1.3365 
+1
source share

All Articles