Splitting multiple columns into rows in pandas dataframe

Question

Splitting multiple columns into rows in pandas dataframe

I have a pandas dataframe as follows:

ticker account value date aa assets 100,200 20121231, 20131231 bb liabilities 50, 150 20141231, 20131231

I would like to separate df['value'] and df['date'] so that the dataframe looks like this:

 ticker account value date aa assets 100 20121231 aa assets 200 20131231 bb liabilities 50 20141231 bb liabilities 150 20131231

I would really appreciate any help.

+5

split join pandas dataframe multiple-columns

ctan Jul 29 '16 at 5:21

source share

4 answers

I often notice this question. That is, how do I split this column that has a list of several rows? I saw what it was called an explosion. Here are some links:

So, I wrote a function that will do this.

 def explode(df, columns): idx = np.repeat(df.index, df[columns[0]].str.len()) a = df.T.reindex_axis(columns).values concat = np.concatenate([np.concatenate(a[i]) for i in range(a.shape[0])]) p = pd.DataFrame(concat.reshape(a.shape[0], -1).T, idx, columns) return pd.concat([df.drop(columns, axis=1), p], axis=1).reset_index(drop=True)

But before we can use it, we need lists (or iterable) in the column.

Customization

 df = pd.DataFrame([['aa', 'assets', '100,200', '20121231,20131231'], ['bb', 'liabilities', '50,50', '20141231,20131231']], columns=['ticker', 'account', 'value', 'date']) df

split value and date :

 df.value = df.value.str.split(',') df.date = df.date.str.split(',') df

Now we could explode in any column or both, and then one after another.

Decision

 explode(df, ['value','date'])

Timing

I removed strip from @jezrael time because I could not effectively add it to mine. This is a necessary step for this question, since the OP has spaces in the lines after commas. I sought to provide a general way to explode a column, given that it already has iterations in it, and I think I did it.

code

 def get_df(n=1): return pd.DataFrame([['aa', 'assets', '100,200,200', '20121231,20131231,20131231'], ['bb', 'liabilities', '50,50', '20141231,20131231']] * n, columns=['ticker', 'account', 'value', 'date'])

small two-line example

average example 200 lines

high mileage of 2,000,000 lines

+6

piRSquared Jul 29 '16 at 7:00

source share

I wrote an explode function based on previous answers. This can be useful for those who want to quickly grab and use it.

 def explode(df, cols, split_on=','): """ Explode dataframe on the given column, split on given delimeter """ cols_sep = list(set(df.columns) - set(cols)) df_cols = df[cols_sep] explode_len = df[cols[0]].str.split(split_on).map(len) repeat_list = [] for r, e in zip(df_cols.as_matrix(), explode_len): repeat_list.extend([list(r)]*e) df_repeat = pd.DataFrame(repeat_list, columns=cols_sep) df_explode = pd.concat([df[col].str.split(split_on, expand=True).stack().str.strip().reset_index(drop=True) for col in cols], axis=1) df_explode.columns = cols return pd.concat((df_repeat, df_explode), axis=1)

example from @piRSquared:

 df = pd.DataFrame([['aa', 'assets', '100,200', '20121231,20131231'], ['bb', 'liabilities', '50,50', '20141231,20131231']], columns=['ticker', 'account', 'value', 'date']) explode(df, ['value', 'date'])

Output

 +-----------+------+-----+--------+ | account|ticker|value| date| +-----------+------+-----+--------+ | assets| aa| 100|20121231| | assets| aa| 200|20131231| |liabilities| bb| 50|20141231| |liabilities| bb| 50|20131231| +-----------+------+-----+--------+

+1

titipata May 13, '17 at 2:19

source share

Because I'm too new, I'm not allowed to write a comment, so I'm writing an “answer”.

@titipata your answer worked very well, but in my opinion there is a small “error” in your code that I cannot find for myself.

I am working with an example from this question and changing only the values.

 df = pd.DataFrame([['title1', 'publisher1', '1.1,1.2', '1'], ['title2', 'publisher2', '2', '2.1,2.2']], columns=['titel', 'publisher', 'print', 'electronic']) explode(df, ['print', 'electronic']) publisher titel print electronic 0 publisher1 title1 1.1 1 1 publisher1 title1 1.2 2.1 2 publisher2 title2 2 2.2

As you can see, in the "electronic" column there should be a value of "1" in the row "1", not "2.1".

Because of this, the opening of the DataSet will change. Hope someone can help me find a solution for this.

0

Caro Aug 19 '17 at 10:35

source share

jezrael · Accepted Answer · 2016-07-29T05:25:29+0000

You can split columns first, create a Series stack and remove the strip :

 s1 = df.value.str.split(',', expand=True).stack().str.strip().reset_index(level=1, drop=True) s2 = df.date.str.split(',', expand=True).stack().str.strip().reset_index(level=1, drop=True)

Then concat both Series to df1 :

 df1 = pd.concat([s1,s2], axis=1, keys=['value','date'])

Remove the old value and date and join columns:

 print (df.drop(['value','date'], axis=1).join(df1).reset_index(drop=True)) ticker account value date 0 aa assets 100 20121231 1 aa assets 200 20131231 2 bb liabilities 50 20141231 3 bb liabilities 150 20131231

Splitting multiple columns into rows in pandas dataframe

Customization

Decision

Timing

More articles: