Splitting multiple columns into rows in pandas dataframe

I have a pandas dataframe as follows:

ticker account value date aa assets 100,200 20121231, 20131231 bb liabilities 50, 150 20141231, 20131231 

I would like to separate df['value'] and df['date'] so that the dataframe looks like this:

 ticker account value date aa assets 100 20121231 aa assets 200 20131231 bb liabilities 50 20141231 bb liabilities 150 20131231 

I would really appreciate any help.

+5
source share
4 answers

You can split columns first, create a Series stack and remove the strip :

 s1 = df.value.str.split(',', expand=True).stack().str.strip().reset_index(level=1, drop=True) s2 = df.date.str.split(',', expand=True).stack().str.strip().reset_index(level=1, drop=True) 

Then concat both Series to df1 :

 df1 = pd.concat([s1,s2], axis=1, keys=['value','date']) 

Remove the old value and date and join columns:

 print (df.drop(['value','date'], axis=1).join(df1).reset_index(drop=True)) ticker account value date 0 aa assets 100 20121231 1 aa assets 200 20131231 2 bb liabilities 50 20141231 3 bb liabilities 150 20131231 
+7
source

I often notice this question. That is, how do I split this column that has a list of several rows? I saw what it was called an explosion. Here are some links:

So, I wrote a function that will do this.

 def explode(df, columns): idx = np.repeat(df.index, df[columns[0]].str.len()) a = df.T.reindex_axis(columns).values concat = np.concatenate([np.concatenate(a[i]) for i in range(a.shape[0])]) p = pd.DataFrame(concat.reshape(a.shape[0], -1).T, idx, columns) return pd.concat([df.drop(columns, axis=1), p], axis=1).reset_index(drop=True) 

But before we can use it, we need lists (or iterable) in the column.

Customization

 df = pd.DataFrame([['aa', 'assets', '100,200', '20121231,20131231'], ['bb', 'liabilities', '50,50', '20141231,20131231']], columns=['ticker', 'account', 'value', 'date']) df 

enter image description here

split value and date :

 df.value = df.value.str.split(',') df.date = df.date.str.split(',') df 

enter image description here

Now we could explode in any column or both, and then one after another.

Decision

 explode(df, ['value','date']) 

enter image description here


Timing

I removed strip from @jezrael time because I could not effectively add it to mine. This is a necessary step for this question, since the OP has spaces in the lines after commas. I sought to provide a general way to explode a column, given that it already has iterations in it, and I think I did it.

code

 def get_df(n=1): return pd.DataFrame([['aa', 'assets', '100,200,200', '20121231,20131231,20131231'], ['bb', 'liabilities', '50,50', '20141231,20131231']] * n, columns=['ticker', 'account', 'value', 'date']) 

small two-line example

enter image description here

average example 200 lines

enter image description here

high mileage of 2,000,000 lines

enter image description here

+6
source

I wrote an explode function based on previous answers. This can be useful for those who want to quickly grab and use it.

 def explode(df, cols, split_on=','): """ Explode dataframe on the given column, split on given delimeter """ cols_sep = list(set(df.columns) - set(cols)) df_cols = df[cols_sep] explode_len = df[cols[0]].str.split(split_on).map(len) repeat_list = [] for r, e in zip(df_cols.as_matrix(), explode_len): repeat_list.extend([list(r)]*e) df_repeat = pd.DataFrame(repeat_list, columns=cols_sep) df_explode = pd.concat([df[col].str.split(split_on, expand=True).stack().str.strip().reset_index(drop=True) for col in cols], axis=1) df_explode.columns = cols return pd.concat((df_repeat, df_explode), axis=1) 

example from @piRSquared:

 df = pd.DataFrame([['aa', 'assets', '100,200', '20121231,20131231'], ['bb', 'liabilities', '50,50', '20141231,20131231']], columns=['ticker', 'account', 'value', 'date']) explode(df, ['value', 'date']) 

Output

 +-----------+------+-----+--------+ | account|ticker|value| date| +-----------+------+-----+--------+ | assets| aa| 100|20121231| | assets| aa| 200|20131231| |liabilities| bb| 50|20141231| |liabilities| bb| 50|20131231| +-----------+------+-----+--------+ 
+1
source

Because I'm too new, I'm not allowed to write a comment, so I'm writing an β€œanswer”.

@titipata your answer worked very well, but in my opinion there is a small β€œerror” in your code that I cannot find for myself.

I am working with an example from this question and changing only the values.

 df = pd.DataFrame([['title1', 'publisher1', '1.1,1.2', '1'], ['title2', 'publisher2', '2', '2.1,2.2']], columns=['titel', 'publisher', 'print', 'electronic']) explode(df, ['print', 'electronic']) publisher titel print electronic 0 publisher1 title1 1.1 1 1 publisher1 title1 1.2 2.1 2 publisher2 title2 2 2.2 

As you can see, in the "electronic" column there should be a value of "1" in the row "1", not "2.1".

Because of this, the opening of the DataSet will change. Hope someone can help me find a solution for this.

0
source

All Articles