Pythonic way to lag columns with date and time index

Question

Pythonic way to lag columns with date and time index

I have dataframes with various types of DateTime indexes (there may be weekly, monthly, annual data). I want to create columns that are lagging values of other columns. I get them imported from a spreadsheet, I do not generate a datetime index inside python.

I am struggling to find a "pythonic" way to do this. I believe that if you use the Pandas' datetime function, latency may be more reliable in the case of strange or exceptional data.

I made a toy example that seems to work, but it does not cope with my real example.

An example of a toy that works correctly (creates a new column with the value "foo" of the previous month)

rng = pd.date_range('2012-01-01', '2013-1-01', freq="M") toy2 = pd.DataFrame(pd.Series(np.random.randint(0, 50, len(rng)), index=rng, name="foo")) foo 2012-01-31 4 2012-02-29 2 2012-03-31 27 2012-04-30 7 2012-05-31 44 2012-06-30 22 2012-07-31 16 2012-08-31 18 2012-09-30 35 2012-10-31 35 2012-11-30 16 2012-12-31 32 toy2['lag_foo']= toy2['foo'].shift(1,'m') foo lag_foo 2012-01-31 4 NaN 2012-02-29 2 4.0 2012-03-31 27 2.0 2012-04-30 7 27.0 2012-05-31 44 7.0 2012-06-30 22 44.0 2012-07-31 16 22.0 2012-08-31 18 16.0 2012-09-30 35 18.0 2012-10-31 35 35.0 2012-11-30 16 35.0 2012-12-31 32 16.0

But when I run this on my example in real life, it fails:

ValueError: cannot re-index from the duplicate axis

 print type(toy) print toy.columns print toy['IPE m2'][0:5] <class 'pandas.core.frame.DataFrame'> Index([u'IPE m2'], dtype='object') Date 2016-04-30 43.29 2016-03-31 40.44 2016-02-29 34.17 2016-01-31 32.47 2015-12-31 39.35 Name: IPE m2, dtype: float64

Exception Trace:

 ValueError Traceback (most recent call last) <ipython-input-170-9cb57a2ed681> in <module>() ----> 1 toy['prev_1m']= toy['IPE m2'].shift(1,'m') C:\Users\mds\Anaconda2\lib\site-packages\pandas\core\frame.pyc in __setitem__(self, key, value) 2355 else: 2356 # set column -> 2357 self._set_item(key, value) 2358 2359 def _setitem_slice(self, key, value): C:\Users\mds\Anaconda2\lib\site-packages\pandas\core\frame.pyc in _set_item(self, key, value) 2421 2422 self._ensure_valid_index(value) -> 2423 value = self._sanitize_column(key, value) 2424 NDFrame._set_item(self, key, value) 2425 C:\Users\mds\Anaconda2\lib\site-packages\pandas\core\frame.pyc in _sanitize_column(self, key, value) 2555 2556 if isinstance(value, Series): -> 2557 value = reindexer(value) 2558 2559 elif isinstance(value, DataFrame): C:\Users\mds\Anaconda2\lib\site-packages\pandas\core\frame.pyc in reindexer(value) 2547 # duplicate axis 2548 if not value.index.is_unique: -> 2549 raise e 2550 2551 # other ValueError: cannot reindex from a duplicate axis

It seems like I am missing some of the subtlety of datetime Pandas indexes that I think. Plus I'm not even sure if this is the perfect way to do this. the only thing I could suspect was that the idle toy.index has None as the frequency, while the working example of toy2 has its frequency set to "M"

 toy.index DatetimeIndex(['2016-04-30', '2016-03-31', '2016-02-29', '2016-01-31', '2015-12-31', '2015-11-30', '2015-10-31', '2015-09-30', '2015-08-31', '2015-07-31', ... 'NaT', 'NaT', 'NaT', 'NaT', 'NaT', 'NaT', 'NaT', 'NaT', 'NaT', 'NaT'], dtype='datetime64[ns]', name=u'Date', length=142, freq=None) toy2.index DatetimeIndex(['2012-01-31', '2012-02-29', '2012-03-31', '2012-04-30', '2012-05-31', '2012-06-30', '2012-07-31', '2012-08-31', '2012-09-30', '2012-10-31', '2012-11-30', '2012-12-31'], dtype='datetime64[ns]', freq='M') In [ ]:

=============================

I threw out NaT

 toy = toy.dropna() toy['prev_1m']= toy['IPE m2'].shift(1,'m')

and I get the results that I wanted. However, I also get a warning:

 C:\Users\mds\Anaconda2\lib\site-packages\ipykernel\__main__.py:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy if __name__ == '__main__':

====

this assignment method suppresses warnings:

 toy.loc[:,'prev_1m2']= toy['IPE m2'].shift(1,'m')

+5

python pandas datetime

user3556757 May 29 '16 at 6:08

source share

1 answer

jezrael · Answer 1 · 2016-05-29T06:24:21+0000

There is another problem - a lot of NaT in the index in toy DataFrame , so index has duplicate values. (Maybe some time and duplication too.)

Example:

 import pandas as pd import numpy as np rng = pd.date_range('2012-01-01', '2013-1-01', freq="M") toy2 = pd.DataFrame(pd.Series(np.random.randint(0, 50, len(rng)), index=rng, name="foo")) df = pd.DataFrame({'foo': [10,30,19]}, index=[np.nan, np.nan, np.nan]) print (df) foo NaN 10 NaN 30 NaN 19 toy2 = pd.concat([toy2, df]) print (toy2) foo 2012-01-31 18 2012-02-29 34 2012-03-31 43 2012-04-30 17 2012-05-31 45 2012-06-30 8 2012-07-31 36 2012-08-31 26 2012-09-30 5 2012-10-31 18 2012-11-30 39 2012-12-31 3 NaT 10 NaT 30 NaT 19 toy2['lag_foo']= toy2['foo'].shift(1,'m') print (toy2)

ValueError: cannot re-index from the duplicate axis

One possible solution would be to skip the freq=m parameter:

 toy2['lag_foo']= toy2['foo'].shift(1) print (toy2) foo lag_foo 2012-01-31 21 NaN 2012-02-29 13 21.0 2012-03-31 41 13.0 2012-04-30 38 41.0 2012-05-31 15 38.0 2012-06-30 41 15.0 2012-07-31 30 41.0 2012-08-31 18 30.0 2012-09-30 12 18.0 2012-10-31 35 12.0 2012-11-30 23 35.0 2012-12-31 7 23.0 NaT 10 7.0 NaT 30 10.0 NaT 19 30.0

If you need to delete all entries using NaN ( NaT ) in index , use notnull with boolean indexing :

 print (toy2) foo 2012-01-31 41 2012-02-29 15 2012-03-31 8 2012-04-30 2 2012-05-31 16 2012-06-30 43 2012-07-31 2 2012-08-31 15 2012-09-30 3 2012-10-31 46 2012-11-30 34 2012-12-31 36 NaT 10 NaT 30 NaT 19 toy2 = toy2[pd.notnull(toy2.index)] toy2['lag_foo']= toy2['foo'].shift(1, 'm') print (toy2) foo lag_foo 2012-01-31 41 NaN 2012-02-29 15 41.0 2012-03-31 8 15.0 2012-04-30 2 8.0 2012-05-31 16 2.0 2012-06-30 43 16.0 2012-07-31 2 43.0 2012-08-31 15 2.0 2012-09-30 3 15.0 2012-10-31 46 3.0 2012-11-30 34 46.0 2012-12-31 36 34.0

Pythonic way to lag columns with date and time index

====

More articles: