How to effectively combine similar data in Pandas into one giant data frame

I have 7000 data frames with columns

Date, X_1 Date, X_2 ... 

Each data file has about 2500 lines.

Dates sometimes overlap but are not guaranteed.

I would like to combine them in a form dataframe

 Date X_1 X_2 etc. 

I tried applying combine_first 7000 times, but it was very slow, since it had to create 7000 new objects, each of which is slightly larger than the last.

Is there a more efficient way to combine multiple data frames?

+4
source share
2 answers

Assuming Date is an index, not a column, you can do an β€œexternal” join :

 df1.join([df2, df3, ..., df7000], how='outer') 

Note. It might be more efficient to pass a DataFrames generator rather than a list.

For instance:

 df1 = pd.DataFrame([[1, 2]], columns=['a', 'b']) df2 = pd.DataFrame([[3, 4]], index=[1], columns=['c', 'd']) df3 = pd.DataFrame([[5, 6], [7, 8]], columns=['e', 'f']) In [4]: df1.join([df2, df3], how='outer') Out[4]: abcdef 0 1 2 NaN NaN 5 6 1 NaN NaN 3 4 7 8 

.

If 'Date' is a column, you can use set_index first:

 df1.set_index('Date', inplace=True) 
+4
source

how about that.

 list_of_dfs = os.listdir(dir_with_data) df = concat(list_of_dfs) df.set_index('Date') df = df.unstack() 
0
source

All Articles