Why does the pandas DataFrame class mean loss of failures, but no amount, and how to make it work

Question

Why does the pandas DataFrame class mean loss of failures, but no amount, and how to make it work

In Python, Pandas may be a more sensible way, but the following example should, but does not work:

import pandas as pd import numpy as np df1 = pd.DataFrame([[1, 0], [1, 2], [2, 0]], columns=['a', 'b']) df2 = df1.copy() df3 = df1.copy() idx = pd.date_range("2010-01-01", freq='H', periods=3) s = pd.Series([df1, df2, df3], index=idx) # This causes an error s.mean()

I will not post all the traffic, but the main error message is interesting:

 TypeError: Could not convert melt T_s 0 6 12 1 0 6 2 6 10 to numeric

It looks like the dataframe was successfully summed, but was not divided by the length of the series.

However, we can take the amount of data in the series:

 s.sum()

Returns

  ab 0 6 12 1 0 6 2 6 10

Why not mean work when summarizing? Is this a bug or a missing feature? It works:

 (df1 + df2 + df3)/3.0

So:

 s.sum()/3.0 ab 0 2 4.000000 1 0 2.000000 2 2 3.333333

But this, of course, is not ideal.

+4

python pandas

Mati Turner Dec 30 '15 at 16:59

source share

2 answers

You can (as @unutbu suggested) use a hierarchical index, but when you have a three-dimensional array, you should consider using the pandas Panel . "Especially when one of the dimensions is time, as in this case.

The panel is often ignored, but in the end it comes from the name pandas. (Panel Data System or something like that).

The data is slightly different from your original, so there are no two sizes with the same length:

 df1 = pd.DataFrame([[1, 0], [1, 2], [2, 0], [2, 3]], columns=['a', 'b']) df2 = df1 + 1 df3 = df1 + 10

Panels can be created in several ways, but one of them. You can create a dict from your index and dataframes with:

 s = pd.Panel(dict(zip(idx,[df1,df2,df3])))

The average value you are looking for is just working with the right axis (in this case axis = 0):

 s.mean(axis=0) Out[80]: ab 0 4.666667 3.666667 1 4.666667 5.666667 2 5.666667 3.666667 3 5.666667 6.666667

With your data, sum(axis=0) returns the expected result.

EDIT: OK is too late for panels, because the hierarchical index approach has already been adopted. I will say that this approach is preferable if the data is known as “dangling” with an unknown but different number in each group. For square data, the panel is the absolute way to go and will be significantly faster with more built-in operations. pandas 0.15 has many improvements for multi-level indexing, but still has limitations and dark edges in real applications.

+8

Phil cooper Dec 30 '15 at 17:40

source share

unutbu · Accepted Answer · 2014-12-30T17:11:21+0000

When you define s with

 s = pd.Series([df1, df2, df3], index=idx)

you get a series with DataFrames as elements:

 In [77]: s Out[77]: 2010-01-01 00:00:00 ab 0 1 0 1 1 2 2 2 0 2010-01-01 01:00:00 ab 0 1 0 1 1 2 2 2 0 2010-01-01 02:00:00 ab 0 1 0 1 1 2 2 2 0 Freq: H, dtype: object

The sum of the elements is a DataFrame:

 In [78]: s.sum() Out[78]: ab 0 3 0 1 3 6 2 6 0

but when you take the average, nanops.nanmean is called :

 def nanmean(values, axis=None, skipna=True): values, mask, dtype, dtype_max = _get_values(values, skipna, 0) the_sum = _ensure_numeric(values.sum(axis, dtype=dtype_max)) ...

Note that _ensure_numeric ( source code ) is called in the total amount. The error occurs because the DataFrame is not numeric.

Here is a workaround. Instead of creating a series with DataFrames as elements, you can combine the DataFrames into a new DataFrame with a hierarchical index :

 In [79]: s = pd.concat([df1, df2, df3], keys=idx) In [80]: s Out[80]: ab 2010-01-01 00:00:00 0 1 0 1 1 2 2 2 0 2010-01-01 01:00:00 0 1 0 1 1 2 2 2 0 2010-01-01 02:00:00 0 1 0 1 1 2 2 2 0

Now you can take sum and mean :

 In [82]: s.sum(level=1) Out[82]: ab 0 3 0 1 3 6 2 6 0 In [84]: s.mean(level=1) Out[84]: ab 0 1 0 1 1 2 2 2 0

Why does the pandas DataFrame class mean loss of failures, but no amount, and how to make it work

More articles: