Insert 0 values for missing dates in MultiIndex

Question

Insert 0 values for missing dates in MultiIndex

Suppose I have a MultiIndex, which consists of a date and some categories (one for simplicity in the example below), and for each category I have a time series with the values of some process. I only have a value when an observation was observed, and now I want to add "0" when there were no observations on that date. I found a way that seems very inefficient (stacking and locking, which will create many columns in the case of millions of categories).

import datetime as dt import pandas as pd days= 4 #List of all dates that should be in the index all_dates = [datetime.date(2013, 2, 13) - dt.timedelta(days=x) for x in range(days)] df = pd.DataFrame([ (datetime.date(2013, 2, 10), 1, 4), (datetime.date(2013, 2, 10), 2, 7), (datetime.date(2013, 2, 11), 2, 7), (datetime.date(2013, 2, 13), 1, 2), (datetime.date(2013, 2, 13), 2, 3)], columns = ['date', 'category', 'value']) df.set_index(['date', 'category'], inplace=True) print df print df.unstack().reindex(all_dates).fillna(0).stack() # insert 0 values for missing dates print all_dates value date category 2013-02-10 1 4 2 7 2013-02-11 2 7 2013-02-13 1 2 2 3 value category 2013-02-13 1 2 2 3 2013-02-12 1 0 2 0 2013-02-11 1 0 2 7 2013-02-10 1 4 2 7 [datetime.date(2013, 2, 13), datetime.date(2013, 2, 12), datetime.date(2013, 2, 11), datetime.date(2013, 2, 10)]

Does anyone know a smarter way to achieve the same?

EDIT: I found another opportunity to achieve the same:

 import datetime as dt import pandas as pd days= 4 #List of all dates that should be in the index all_dates = [datetime.date(2013, 2, 13) - dt.timedelta(days=x) for x in range(days)] df = pd.DataFrame([(datetime.date(2013, 2, 10), 1, 4, 5), (datetime.date(2013, 2, 10), 2,1, 7), (datetime.date(2013, 2, 10), 2,2, 7), (datetime.date(2013, 2, 11), 2,3, 7), (datetime.date(2013, 2, 13), 1,4, 2), (datetime.date(2013, 2, 13), 2,4, 3)], columns = ['date', 'category', 'cat2', 'value']) date_col = 'date' other_index = ['category', 'cat2'] index = [date_col] + other_index df.set_index(index, inplace=True) grouped = df.groupby(level=other_index) df_list = [] for i, group in grouped: df_list.append(group.reset_index(level=other_index).reindex(all_dates).fillna(0)) print pd.concat(df_list).set_index(other_index, append=True) value category cat2 2013-02-13 1 4 2 2013-02-12 0 0 0 2013-02-11 0 0 0 2013-02-10 1 4 5 2013-02-13 0 0 0 2013-02-12 0 0 0 2013-02-11 0 0 0 2013-02-10 2 1 7 2013-02-13 0 0 0 2013-02-12 0 0 0 2013-02-11 0 0 0 2013-02-10 2 2 7 2013-02-13 0 0 0 2013-02-12 0 0 0 2013-02-11 2 3 7 2013-02-10 0 0 0 2013-02-13 2 4 3 2013-02-12 0 0 0 2013-02-11 0 0 0 2013-02-10 0 0 0

+4

pandas

Arthur g Feb 13 '13 at 15:26

source share

2 answers

Christian long · Answer 1 · 2016-12-22T03:05:49+0000

You can create a new multi-index based on the Cartesian product of the required index levels. Then reindex your data frame with the new index.

 (date_index, category_index) = df.index.levels new_index = pd.MultiIndex.from_product([all_dates, category_index]) new_df = df.reindex(new_index) # Optional: convert missing values to zero, and convert the data back # to integers. See explanation below. new_df = new_df.fillna(0).astype(int)

What is it! The new data frame has all possible index values. Existing data is indexed correctly.

Read on for a more detailed explanation.

Description

Sample Data Setup

 import datetime as dt import pandas as pd days= 4 #List of all dates that should be in the index all_dates = [dt.date(2013, 2, 13) - dt.timedelta(days=x) for x in range(days)] df = pd.DataFrame([ (dt.date(2013, 2, 10), 1, 4), (dt.date(2013, 2, 10), 2, 7), (dt.date(2013, 2, 11), 2, 7), (dt.date(2013, 2, 13), 1, 2), (dt.date(2013, 2, 13), 2, 3)], columns = ['date', 'category', 'value']) df.set_index(['date', 'category'], inplace=True)

Here, that sample data looks like

  value date category 2013-02-10 1 4 2 7 2013-02-11 2 7 2013-02-13 1 2 2 3

Create a new index

Using from_product , we can create a new multi-index. This new index is the Cartesian product of all the values that you pass to the function.

 (date_index, category_index) = df.index.levels new_index = pd.MultiIndex.from_product([all_dates, category_index])

Reindex

Use the new index to override the existing data frame.

Now there are all possible combinations. Invalid values: null (NaN).

 new_df = df.reindex(new_index)

Now an extended, reindexed data frame looks like this:

  value 2013-02-13 1 2.0 2 3.0 2013-02-12 1 NaN 2 NaN 2013-02-11 1 NaN 2 7.0 2013-02-10 1 4.0 2 7.0

Zeros in an integer column

You can see that the data in the new data frame has been converted from ints to float. Pandas cannot have zeros in an integer column . If desired, we can convert all zeros to 0 and return the data back to integers.

 new_df = new_df.fillna(0).astype(int)

Result

  value 2013-02-13 1 2 2 3 2013-02-12 1 0 2 0 2013-02-11 1 0 2 7 2013-02-10 1 4 2 7

zach · Answer 2 · 2013-02-13T20:05:21+0000

Reserve this answer: How to fill in a missing Pandas dataframe record on a pythonic path?

You can do something like:

 import datetime import pandas as pd #make an empty dataframe with the index you want def get_datetime(x): return datetime.date(2013, 2, 13)- datetime.timedelta(days=x) all_dates = [ get_datetime(x) for x in range(4)] categories = [1,2,3,4] index = [ [date, cat] for cat in categories for date in all_dates ] #this df will be just an index df = pd.DataFrame(index) df =print df.set_index([0,1]) df.columns = ['date', 'category'] df = df.set_index(['date', 'category']) #now if your original df is called df_original you can reindex against the other values df_orig = df_orig.reindex_axis(df.index) #and to add zeros df_orig.fillna(0)

Insert 0 values ​​for missing dates in MultiIndex

Description

Sample Data Setup

Create a new index

Reindex

Zeros in an integer column

More articles:

Insert 0 values for missing dates in MultiIndex