Pandas: combine data files, forward fill, and multi-index by column data

I have 2 csv files with the same column names but different values.

The first column is the index ( time ), and one of the data columns is a unique identifier ( id )

The index ( time ) is different for each csv file.

I read the data on 2 data frames using read_csv , providing me with the following:

  +-------+------+-------+ | id | size | price | +-------+-------+------+-------+ | time | | | | +-------+-------+------+-------+ | t0 | ID1 | 10 | 110 | | t2 | ID1 | 12 | 109 | | t6 | ID1 | 20 | 108 | +-------+-------+------+-------+ +-------+------+-------+ | id | size | price | +-------+-------+------+-------+ | time | | | | +-------+-------+------+-------+ | t1 | ID2 | 9 | 97 | | t3 | ID2 | 15 | 94 | | t5 | ID2 | 13 | 100 | +-------+-------+------+-------+ 

I would like to create a single large framework with entries for both and use ffill to send fill values ​​from the previous time step.

I can achieve this using a combination of concat , sort and ffill .

However, this requires first renaming the columns of one of the data frames so that there are no name conflicts

 df2.columns = [ 'id', 'id2_size', 'id2_price' ] df = pd.concat([df1, df2]).sort().ffill() 

This results in the following file frame:

  +------+------+-------+----------+-----------+ | id | size | price | id2_size | id2_price | +-------+------+------+-------+----------+-----------+ | time | | | | | | +-------+------+------+-------+----------+-----------+ | t0 | ID1 | 10 | 110 | nan | nan | | t1 | ID2 | 10 | 110 | 9 | 97 | | t2 | ID1 | 12 | 109 | 9 | 97 | | t3 | ID2 | 12 | 109 | 15 | 94 | | t5 | ID2 | 12 | 109 | 13 | 100 | | t6 | ID1 | 20 | 108 | 13 | 100 | +-------+------+------+-------+----------+-----------+ 

My current method is pretty klunky in that I need to rename the columns of one of the data files.

I believe that the best way to represent the data would be to use multiindex with the second dimension value coming from the id column.

The resulting data file will look like this:

  +--------------+--------------+ | ID1 | ID2 | +------+-------+------+-------+ | size | price | size | price | +-------+------+-------+------+-------+ | time | | | | | +-------+------+-------+------+-------+ | t0 | 10 | 110 | nan | nan | | t1 | 10 | 110 | 9 | 97 | | t2 | 12 | 109 | 9 | 97 | | t3 | 12 | 109 | 15 | 94 | | t5 | 12 | 109 | 13 | 100 | | t6 | 20 | 108 | 13 | 100 | +-------+------+-------+------+-------+ 

Is it possible?
If so, what steps will be required to move from two data frames read from csv to the final combined multi-indexed file frame?

+5
source share
1 answer

Here's one liner that does what you ask for, although it's a bit confusing in terms of stacking / unpacking:

 df1.append(df2).set_index(['time','id']).sort().stack().unstack(level=[1,2]).ffill() id ID1 ID2 size price size price time t0 10 110 NaN NaN t1 10 110 9 97 t2 12 109 9 97 t3 12 109 15 94 t5 12 109 13 100 t6 20 108 13 100 

FWIW, my default approach would be something like the following, which would be a bit simpler (less stacking / debugging) and give you the same basic results, but with a different column organization:

 df1.append(df2).set_index(['time','id']).sort().unstack().ffill() size price id ID1 ID2 ID1 ID2 time t0 10 NaN 110 NaN t1 10 9 110 97 t2 12 9 109 97 t3 12 15 109 94 t5 12 13 109 100 t6 20 13 108 100 

And on these lines, you can add swaplevel and sort to get the columns reorganized in the first approach:

 df1.append(df2).set_index(['time','id']).sort().unstack().ffill().swaplevel(0,1,axis=1).sort(axis=1) 
+1
source

All Articles