Setting DataFrame Values with Extension

Question

Setting DataFrame Values with Extension

I have two DataFrames (with DatetimeIndex ) and want to update the first frame (older) with data from the second frame (newer).

A new frame may contain more recent data for rows already contained in the old frame. In this case, the data in the old frame should be overwritten with data from the new frame. In addition, a new frame may contain more columns / rows than the first. In this case, the old frame should be enlarged with the data in the new frame.

Pandas docs state which

" .loc/.ix/[] operations can perform an increase when setting up a nonexistent key for this axis"

and

"a DataFrame can be enlarged on any axis with .loc "

However, this does not seem to work and throws a KeyError . Example:

 In [195]: df1 Out[195]: ABC 2015-07-09 12:00:00 1 1 1 2015-07-09 13:00:00 1 1 1 2015-07-09 14:00:00 1 1 1 2015-07-09 15:00:00 1 1 1 In [196]: df2 Out[196]: ABCD 2015-07-09 14:00:00 2 2 2 2 2015-07-09 15:00:00 2 2 2 2 2015-07-09 16:00:00 2 2 2 2 2015-07-09 17:00:00 2 2 2 2 In [197]: df1.loc[df2.index] = df2 --------------------------------------------------------------------------- KeyError Traceback (most recent call last) <ipython-input-197-74e630e87cf8> in <module>() ----> 1 df1.loc[df2.index] = df2 /.../pandas/core/indexing.pyc in __setitem__(self, key, value) 112 113 def __setitem__(self, key, value): --> 114 indexer = self._get_setitem_indexer(key) 115 self._setitem_with_indexer(indexer, value) 116 /.../pandas/core/indexing.pyc in _get_setitem_indexer(self, key) 107 108 try: --> 109 return self._convert_to_indexer(key, is_setter=True) 110 except TypeError: 111 raise IndexingError(key) /.../pandas/core/indexing.pyc in _convert_to_indexer(self, obj, axis, is_setter) 1110 mask = check == -1 1111 if mask.any(): -> 1112 raise KeyError('%s not in index' % objarr[mask]) 1113 1114 return _values_from_object(indexer) KeyError: "['2015-07-09T18:00:00.000000000+0200' '2015-07-09T19:00:00.000000000+0200'] not in index"

What is the best way (in terms of performance, since my real data is much larger), the two achieve the desired updated and extended DataFrame. This is the result that I would like to see:

  ABCD 2015-07-09 12:00:00 1 1 1 NaN 2015-07-09 13:00:00 1 1 1 NaN 2015-07-09 14:00:00 2 2 2 2 2015-07-09 15:00:00 2 2 2 2 2015-07-09 16:00:00 2 2 2 2 2015-07-09 17:00:00 2 2 2 2

+7

python pandas

bmu Jul 9 '15 at 14:05

source share

3 answers

You can use the combine function.

 import pandas as pd # your data # =========================================================== df1 = pd.DataFrame(np.ones(12).reshape(4,3), columns='AB C'.split(), index=pd.date_range('2015-07-09 12:00:00', periods=4, freq='H')) df2 = pd.DataFrame(np.ones(16).reshape(4,4)*2, columns='ABC D'.split(), index=pd.date_range('2015-07-09 14:00:00', periods=4, freq='H')) # processing # ===================================================== # reindex to populate NaN result = df2.reindex(np.union1d(df1.index, df2.index)) Out[248]: ABCD 2015-07-09 12:00:00 NaN NaN NaN NaN 2015-07-09 13:00:00 NaN NaN NaN NaN 2015-07-09 14:00:00 2 2 2 2 2015-07-09 15:00:00 2 2 2 2 2015-07-09 16:00:00 2 2 2 2 2015-07-09 17:00:00 2 2 2 2 combiner = lambda x, y: np.where(x.isnull(), y, x) # use df1 to update result result.combine(df1, combiner) Out[249]: ABCD 2015-07-09 12:00:00 1 1 1 NaN 2015-07-09 13:00:00 1 1 1 NaN 2015-07-09 14:00:00 2 2 2 2 2015-07-09 15:00:00 2 2 2 2 2015-07-09 16:00:00 2 2 2 2 2015-07-09 17:00:00 2 2 2 2 # maybe fillna(method='ffill') if you like

+6

Jianxun li Jul 9 '15 at 14:24

source share

In addition to the previous answer, after reindexing you can use

 result.fillna(df1, inplace=True)

so based on Jianxun Li code (extended with another column) you can try this

 # your data # =========================================================== df1 = pd.DataFrame(np.ones(12).reshape(4,3), columns='AB C'.split(), index=pd.date_range('2015-07-09 12:00:00', periods=4, freq='H')) df2 = pd.DataFrame(np.ones(20).reshape(4,5)*2, columns='ABCD E'.split(), index=pd.date_range('2015-07-09 14:00:00', periods=4, freq='H')) # processing # ===================================================== # reindex to populate NaN result = df2.reindex(np.union1d(df1.index, df2.index)) # fill NaN from df1 result.fillna(df1, inplace=True) Out[3]: ABCDE 2015-07-09 12:00:00 1 1 1 NaN NaN 2015-07-09 13:00:00 1 1 1 NaN NaN 2015-07-09 14:00:00 2 2 2 2 2 2015-07-09 15:00:00 2 2 2 2 2 2015-07-09 16:00:00 2 2 2 2 2 2015-07-09 17:00:00 2 2 2 2 2

+3

herbico Apr 9 '16 at 23:19

source share

Joshua baboo · Accepted Answer · 2016-04-11T17:37:05+0000

df2.combine_first(df1) ( documentation ) seems to serve your requirement; PFB code snippet and output

 import pandas as pd print 'pandas-version: ', pd.__version__ df1 = pd.DataFrame.from_records([('2015-07-09 12:00:00',1,1,1), ('2015-07-09 13:00:00',1,1,1), ('2015-07-09 14:00:00',1,1,1), ('2015-07-09 15:00:00',1,1,1)], columns=['Dt', 'A', 'B', 'C']).set_index('Dt') # print df1 df2 = pd.DataFrame.from_records([('2015-07-09 14:00:00',2,2,2,2), ('2015-07-09 15:00:00',2,2,2,2), ('2015-07-09 16:00:00',2,2,2,2), ('2015-07-09 17:00:00',2,2,2,2),], columns=['Dt', 'A', 'B', 'C', 'D']).set_index('Dt') res_combine1st = df2.combine_first(df1) print res_combine1st

Exit

 pandas-version: 0.15.2 ABCD Dt 2015-07-09 12:00:00 1 1 1 NaN 2015-07-09 13:00:00 1 1 1 NaN 2015-07-09 14:00:00 2 2 2 2 2015-07-09 15:00:00 2 2 2 2 2015-07-09 16:00:00 2 2 2 2 2015-07-09 17:00:00 2 2 2 2

Setting DataFrame Values ​​with Extension

Exit

More articles:

Setting DataFrame Values with Extension