Setting values โ€‹โ€‹with pandas.DataFrame

The presence of this DataFrame:

import pandas dates = pandas.date_range('2016-01-01', periods=5, freq='H') s = pandas.Series([0, 1, 2, 3, 4], index=dates) df = pandas.DataFrame([(1, 2, s, 8)], columns=['a', 'b', 'foo', 'bar']) df.set_index(['a', 'b'], inplace=True) df 

enter image description here

I would like to replace the Series there with a new one that would just be old, but was re-mapped until the daytime (i.e. x.resample('D').sum().dropna() ).

When I try:

 df['foo'][0] = df['foo'][0].resample('D').sum().dropna() 

This seems to work well:

enter image description here

However, I get a warning:

 SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy 

The question is, how do I do this?

Notes

Things I tried but don't work (oversampling or not, assignment throws an exception):

 df.iloc[0].loc['foo'] = df.iloc[0].loc['foo'] df.loc[(1, 2), 'foo'] = df.loc[(1, 2), 'foo'] df.loc[df.index[0], 'foo'] = df.loc[df.index[0], 'foo'] 

A bit more information about the data (in case it matters):

  • A real DataFrame has more columns in a multi-index. Not all of them are necessarily integers, but generally numerical and categorical. The index is unique (i.e. there is only one row with a given index value).
  • The real DataFrame has, of course, many more rows in it (thousands).
  • A DataFrame does not have to have only two columns, and there can be more than 1 column containing the type Series. Columns usually contain rows, categorical data, and numeric data. Any single column is always the same type (numeric, categorical or serial).
  • The series contained in each cell usually has a variable length (i.e. the two series / cells in the DataFrame do not, unless a pure coincidence, have the same length and will probably never have the same index anyway, as the dates change how good between series).

Using Python 3.5.1 and Pandas 0.18.1.

+8
python pandas
source share
3 answers

This should work:

 df.iat[0, df.columns.get_loc('foo')] = df['foo'][0].resample('D').sum().dropna() 

Pandas complains about the indexing chain, but when you don't, it runs into problems by assigning a whole series to a cell. With iat you can force something like this. I don't think that would be preferable, but it seems like a working solution.

+3
source share

Hierarchical data in pandas

It seems you should consider rebuilding your data to take advantage of pandas features like MultiIndexing and DateTimeIndex . This will allow you to work with the index in a typical way , being able to select several columns according to hierarchical data ( a , b and bar ).

Restructured data

 import pandas as pd # Define Index dates = pd.date_range('2016-01-01', periods=5, freq='H') # Define Series s = pd.Series([0, 1, 2, 3, 4], index=dates) # Place Series in Hierarchical DataFrame heirIndex = pd.MultiIndex.from_arrays([1,2,8], names=['a','b', 'bar']) df = pd.DataFrame(s, columns=heirIndex) print df 

 a 1 b 2 bar 8 2016-01-01 00:00:00 0 2016-01-01 01:00:00 1 2016-01-01 02:00:00 2 2016-01-01 03:00:00 3 2016-01-01 04:00:00 4 

Resampling

With data in this format, re-sampling becomes very simple.

 # Simple Direct Resampling df_resampled = df.resample('D').sum().dropna() print df_resampled 

 a 1 b 2 bar 8 2016-01-01 10 

Update (from data description)

If the data is of variable length Series , each of which has different index and non-numeric categories, this is normal. Here is an example:

 # Define Series dates = pandas.date_range('2016-01-01', periods=5, freq='H') s = pandas.Series([0, 1, 2, 3, 4], index=dates) # Define Series dates2 = pandas.date_range('2016-01-14', periods=6, freq='H') s2 = pandas.Series([-200, 10, 24, 30, 40,100], index=dates2) # Define DataFrames df1 = pd.DataFrame(s, columns=pd.MultiIndex.from_arrays([1,2,8,'cat1'], names=['a','b', 'bar','c'])) df2 = pd.DataFrame(s2, columns=pd.MultiIndex.from_arrays([2,5,5,'cat3'], names=['a','b', 'bar','c'])) df = pd.concat([df1, df2]) print df 

 a 1 2 b 2 5 bar 8 5 c cat1 cat3 2016-01-01 00:00:00 0.0 NaN 2016-01-01 01:00:00 1.0 NaN 2016-01-01 02:00:00 2.0 NaN 2016-01-01 03:00:00 3.0 NaN 2016-01-01 04:00:00 4.0 NaN 2016-01-14 00:00:00 NaN -200.0 2016-01-14 01:00:00 NaN 10.0 2016-01-14 02:00:00 NaN 24.0 2016-01-14 03:00:00 NaN 30.0 2016-01-14 04:00:00 NaN 40.0 2016-01-14 05:00:00 NaN 100.0 

The only problems are after oversampling. You will want to use how='all' when deleting na lines like this:

 # Simple Direct Resampling df_resampled = df.resample('D').sum().dropna(how='all') print df_resampled 

 a 1 2 b 2 5 bar 8 5 c cat1 cat3 2016-01-01 10.0 NaN 2016-01-14 NaN 4.0 
0
source share

Just set df.is_copy = False before assigning a new value.

0
source share

All Articles