Separate (explode) pandas data row input into separate lines

I have a pandas dataframe in which one column of text strings contains comma separated values. I want to split each CSV field and create a new line for each record (suppose the CSVs are clean and need to be divided by ","). For example, a should become b :

 In [7]: a Out[7]: var1 var2 0 a,b,c 1 1 d,e,f 2 In [8]: b Out[8]: var1 var2 0 a 1 1 b 1 2 c 1 3 d 2 4 e 2 5 f 2 

So far I have tried various simple functions, but the .apply method seems to accept only one string as the return value when it is used on the axis, and I cannot get .transform to work. Any suggestions would be much appreciated!

Sample data:

 from pandas import DataFrame import numpy as np a = DataFrame([{'var1': 'a,b,c', 'var2': 1}, {'var1': 'd,e,f', 'var2': 2}]) b = DataFrame([{'var1': 'a', 'var2': 1}, {'var1': 'b', 'var2': 1}, {'var1': 'c', 'var2': 1}, {'var1': 'd', 'var2': 2}, {'var1': 'e', 'var2': 2}, {'var1': 'f', 'var2': 2}]) 

I know this will not work because we are losing the DataFrame metadata by going through numpy, but this should give you an idea of ​​what I was trying to do:

 def fun(row): letters = row['var1'] letters = letters.split(',') out = np.array([row] * len(letters)) out['var1'] = letters a['idx'] = range(a.shape[0]) z = a.groupby('idx') z.transform(fun) 
+163
python numpy pandas dataframe
01 Oct
source share
19 answers

How about something like this:

 In [55]: pd.concat([Series(row['var2'], row['var1'].split(',')) for _, row in a.iterrows()]).reset_index() Out[55]: index 0 0 a 1 1 b 1 2 c 1 3 d 2 4 e 2 5 f 2 

Then you just need to rename the columns

+70
01 Oct
source share

UPDATE2: a more general vectorized function that will work for multiple normal and multiple list columns

 def explode(df, lst_cols, fill_value='', preserve_index=False): # make sure 'lst_cols' is list-alike if (lst_cols is not None and len(lst_cols) > 0 and not isinstance(lst_cols, (list, tuple, np.ndarray, pd.Series))): lst_cols = [lst_cols] # all columns except 'lst_cols' idx_cols = df.columns.difference(lst_cols) # calculate lengths of lists lens = df[lst_cols[0]].str.len() # preserve original index values idx = np.repeat(df.index.values, lens) # create "exploded" DF res = (pd.DataFrame({ col:np.repeat(df[col].values, lens) for col in idx_cols}, index=idx) .assign(**{col:np.concatenate(df.loc[lens>0, col].values) for col in lst_cols})) # append those rows that have empty lists if (lens == 0).any(): # at least one list in cells is empty res = (res.append(df.loc[lens==0, idx_cols], sort=False) .fillna(fill_value)) # revert the original index order res = res.sort_index() # reset index if requested if not preserve_index: res = res.reset_index(drop=True) return res 

Demo version:

Multiple list columns - all list columns must have the same number of elements in each row:

 In [134]: df Out[134]: aaa myid num text 0 10 1 [1, 2, 3] [aa, bb, cc] 1 11 2 [] [] 2 12 3 [1, 2] [cc, dd] 3 13 4 [] [] In [135]: explode(df, ['num','text'], fill_value='') Out[135]: aaa myid num text 0 10 1 1 aa 1 10 1 2 bb 2 10 1 3 cc 3 11 2 4 12 3 1 cc 5 12 3 2 dd 6 13 4 

preservation of initial index values:

 In [136]: explode(df, ['num','text'], fill_value='', preserve_index=True) Out[136]: aaa myid num text 0 10 1 1 aa 0 10 1 2 bb 0 10 1 3 cc 1 11 2 2 12 3 1 cc 2 12 3 2 dd 3 13 4 

Tune:

 df = pd.DataFrame({ 'aaa': {0: 10, 1: 11, 2: 12, 3: 13}, 'myid': {0: 1, 1: 2, 2: 3, 3: 4}, 'num': {0: [1, 2, 3], 1: [], 2: [1, 2], 3: []}, 'text': {0: ['aa', 'bb', 'cc'], 1: [], 2: ['cc', 'dd'], 3: []} }) 

CSV Column:

 In [46]: df Out[46]: var1 var2 var3 0 a,b,c 1 XX 1 d,e,f,x,y 2 ZZ In [47]: explode(df.assign(var1=df.var1.str.split(',')), 'var1') Out[47]: var1 var2 var3 0 a 1 XX 1 b 1 XX 2 c 1 XX 3 d 2 ZZ 4 e 2 ZZ 5 f 2 ZZ 6 x 2 ZZ 7 y 2 ZZ 

using this little trick, we can convert a CSV-like column to a list column:

 In [48]: df.assign(var1=df.var1.str.split(',')) Out[48]: var1 var2 var3 0 [a, b, c] 1 XX 1 [d, e, f, x, y] 2 ZZ 



UPDATE: universal vectorized approach (will work also for multiple columns):

Original DF:

 In [177]: df Out[177]: var1 var2 var3 0 a,b,c 1 XX 1 d,e,f,x,y 2 ZZ 

Decision:

first let's convert CSV strings to lists:

 In [178]: lst_col = 'var1' In [179]: x = df.assign(**{lst_col:df[lst_col].str.split(',')}) In [180]: x Out[180]: var1 var2 var3 0 [a, b, c] 1 XX 1 [d, e, f, x, y] 2 ZZ 

Now we can do this:

 In [181]: pd.DataFrame({ ...: col:np.repeat(x[col].values, x[lst_col].str.len()) ...: for col in x.columns.difference([lst_col]) ...: }).assign(**{lst_col:np.concatenate(x[lst_col].values)})[x.columns.tolist()] ...: Out[181]: var1 var2 var3 0 a 1 XX 1 b 1 XX 2 c 1 XX 3 d 2 ZZ 4 e 2 ZZ 5 f 2 ZZ 6 x 2 ZZ 7 y 2 ZZ 



OLD answer:

Inspired by @AFinkelstein 's solution, I wanted to make it a little more generalized so that it could be applied to DF with more than two columns and as fast, and almost as fast as AFinkelstein's solution):

 In [2]: df = pd.DataFrame( ...: [{'var1': 'a,b,c', 'var2': 1, 'var3': 'XX'}, ...: {'var1': 'd,e,f,x,y', 'var2': 2, 'var3': 'ZZ'}] ...: ) In [3]: df Out[3]: var1 var2 var3 0 a,b,c 1 XX 1 d,e,f,x,y 2 ZZ In [4]: (df.set_index(df.columns.drop('var1',1).tolist()) ...: .var1.str.split(',', expand=True) ...: .stack() ...: .reset_index() ...: .rename(columns={0:'var1'}) ...: .loc[:, df.columns] ...: ) Out[4]: var1 var2 var3 0 a 1 XX 1 b 1 XX 2 c 1 XX 3 d 2 ZZ 4 e 2 ZZ 5 f 2 ZZ 6 x 2 ZZ 7 y 2 ZZ 
+125
Nov 06 '16 at 13:12
source share

After hard experimentation, to find something faster than the accepted answer, I got it to work. It worked about 100 times faster in the dataset on which I tried it.

If anyone knows a way to make this more elegant, by all means, please change my code. I could not find a way that works without setting other columns that you want to keep as an index, and then reset the index and rename the columns, but I would assume that something else works.

 b = DataFrame(a.var1.str.split(',').tolist(), index=a.var2).stack() b = b.reset_index()[[0, 'var2']] # var1 variable is currently labeled 0 b.columns = ['var1', 'var2'] # renaming var1 
+94
Jan 28 '15 at 12:28
source share

Here is the function that I wrote for this general task. It is more efficient than Series / stack methods. The order and column names are saved.

 def tidy_split(df, column, sep='|', keep=False): """ Split the values of a column and expand so the new DataFrame has one split value per row. Filters rows where the column is missing. Params ------ df : pandas.DataFrame dataframe with the column to split and expand column : str the column to split and expand sep : str the string used to split the column values keep : bool whether to retain the presplit value as it own row Returns ------- pandas.DataFrame Returns a dataframe with the same columns as `df`. """ indexes = list() new_values = list() df = df.dropna(subset=[column]) for i, presplit in enumerate(df[column].astype(str)): values = presplit.split(sep) if keep and len(values) > 1: indexes.append(i) new_values.append(presplit) for value in values: indexes.append(i) new_values.append(value) new_df = df.iloc[indexes, :].copy() new_df[column] = new_values return new_df 

Using this function, the original question will be as simple as:

 tidy_split(a, 'var1', sep=',') 
+39
Oct 09 '16 at 17:57
source share

Similar question: pandas: How to split text in a column into multiple rows?

You can do:

 >> a=pd.DataFrame({"var1":"a,b,cd,e,f".split(),"var2":[1,2]}) >> s = a.var1.str.split(",").apply(pd.Series, 1).stack() >> s.index = s.index.droplevel(-1) >> del a['var1'] >> a.join(s) var2 var1 0 1 a 0 1 b 0 1 c 1 2 d 1 2 e 1 2 f 
+14
Jun 24 '15 at 21:01
source share

TL; DR

 import pandas as pd import numpy as np def explode_str(df, col, sep): s = df[col] i = np.arange(len(s)).repeat(s.str.count(sep) + 1) return df.iloc[i].assign(**{col: sep.join(s).split(sep)}) def explode_list(df, col): s = df[col] i = np.arange(len(s)).repeat(s.str.len()) return df.iloc[i].assign(**{col: np.concatenate(s)}) 



demonstration

 explode_str(a, 'var1', ',') var1 var2 0 a 1 0 b 1 0 c 1 1 d 2 1 e 2 1 f 2 

Let create a new dataframe d which has lists

 d = a.assign(var1=lambda d: d.var1.str.split(',')) explode_list(d, 'var1') var1 var2 0 a 1 0 b 1 0 c 1 1 d 2 1 e 2 1 f 2 



General comments

I will use np.arange with repeat to create data index positions that I can use with iloc .

Questions & Answers

Why am I not using loc ?

Since the index may not be unique, and using loc will return every row that matches the requested index.

Why don't you use the values and slice attributes?

When calling values , if the integer part of the data frame is in one cohesive "block", Pandas will return a representation of the array, which is a "block". Otherwise, Pandas will have to build a new array. When easel, this array must have the same type. Often this means returning an array with dtype, which is an object . Using iloc instead of shortening the values attribute, I iloc from having to deal with it.

Why are you using assign ?

When I use assign using the same column name that I explode, I overwrite the existing column and keep its position in the data area.

Why are index values ​​repeated?

By using iloc on repeated positions, the resulting index shows the same repeating pattern. One repeat for each item in a list or line.
This can be reset using reset_index(drop=True)




For strings

I don’t want to break the strings prematurely. So instead, I take into account the occurrences of the sep argument, assuming that if I split, the length of the resulting list would be greater than the number of delimiters.

Then I use this sep to join to strings and then split .

 def explode_str(df, col, sep): s = df[col] i = np.arange(len(s)).repeat(s.str.count(sep) + 1) return df.iloc[i].assign(**{col: sep.join(s).split(sep)}) 

For lists

Just like for strings, except that I do not need to count the occurrences of sep because it is already split.

I use concatenate numpy to hush lists together.

 import pandas as pd import numpy as np def explode_list(df, col): s = df[col] i = np.arange(len(s)).repeat(s.str.len()) return df.iloc[i].assign(**{col: np.concatenate(s)}) 



+12
Aug 08
source share

Pandas> = 0.25

The Series and DataFrame methods define the .explode() method, which splits lists into separate lines. See the "Documents" section in the Deploying as a Column section.

Since you have a list of strings separated by commas, split the string into a comma to get a list of items, and then call explode for that column.

 df = pd.DataFrame({'var1': ['a,b,c', 'd,e,f'], 'var2': [1, 2]}) df var1 var2 0 a,b,c 1 1 d,e,f 2 df.assign(var1=df['var1'].str.split(',')).explode('var1') var1 var2 0 a 1 0 b 1 0 c 1 1 d 2 1 e 2 1 f 2 

Note that explode only works with one column (for now).




NaNs and empty lists get the treatment they deserve, without having to jump through hoops to get it right.

 df = pd.DataFrame({'var1': ['d,e,f', '', np.nan], 'var2': [1, 2, 3]}) df var1 var2 0 d,e,f 1 1 2 2 NaN 3 df['var1'].str.split(',') 0 [d, e, f] 1 [] 2 NaN df.assign(var1=df['var1'].str.split(',')).explode('var1') var1 var2 0 d 1 0 e 1 0 f 1 1 2 # empty list entry becomes empty string after exploding 2 NaN 3 # NaN left un-touched 

This is a major advantage over ravel + repeat -based solutions (which completely ignore empty lists and choke on NaN).

+9
Jul 20 '19 at 7:18
source share

I came up with a solution for dataframes with an arbitrary number of columns (at the same time, only dividing one column at a time).

 def splitDataFrameList(df,target_column,separator): ''' df = dataframe to split, target_column = the column containing the values to split separator = the symbol used to perform the split returns: a dataframe with each entry for the target column separated, with each element moved into a new row. The values in the other columns are duplicated across the newly divided rows. ''' def splitListToRows(row,row_accumulator,target_column,separator): split_row = row[target_column].split(separator) for s in split_row: new_row = row.to_dict() new_row[target_column] = s row_accumulator.append(new_row) new_rows = [] df.apply(splitListToRows,axis=1,args = (new_rows,target_column,separator)) new_df = pandas.DataFrame(new_rows) return new_df 
+5
Apr 21 '15 at 9:02
source share

Here is a pretty simple post that uses the split method from pandas str accessor and then uses NumPy to align each row into a single array.

The corresponding values ​​are retrieved by repeating the disproportionate column the correct number of times using np.repeat .

 var1 = df.var1.str.split(',', expand=True).values.ravel() var2 = np.repeat(df.var2.values, len(var1) / len(df)) pd.DataFrame({'var1': var1, 'var2': var2}) var1 var2 0 a 1 1 b 1 2 c 1 3 d 2 4 e 2 5 f 2 
+4
Nov 04 '17 at 17:34 on
source share

The string function split can take the boolean argument option "expand".

Here is a solution using this argument:

 (a.var1 .str.split(",",expand=True) .set_index(a.var2) .stack() .reset_index(level=1, drop=True) .reset_index() .rename(columns={0:"var1"})) 
+3
Jun 05 '18 at 23:42 on
source share

I struggled with the experience of running out of memory using various ways to explode my lists, so I prepared some tests to help me decide which answers to upvote. I tested five scenarios with different proportions of the list length to the number of lists. Share the results below:

Time: (the smaller the better, click to view larger version)

Speed

Peak memory usage: (the less the better)

Peak memory usage

Conclusions :

  • @MaxU answer (update 2), the code name concatenate offers the best speed in almost every case, while maintaining low RAM usage.
  • see @DMulligan answer (stack of code names) if you need to process many lines with relatively small lists and allow yourself to increase peak memory,
  • @Chang's accepted answer works well for data frames that have multiple rows but very large lists.

Full information (functions and benchmarking code) is in this essence of GitHub . Please note that the problem with the benchmark was simplified and did not include breaking the lines in the list - that most of the solutions performed in a similar way.

+3
Jan 22 '19 at 23:45
source share

Based on @DMulligan’s excellent solution , here is a universal vector (no loop) function that splits a data column into several rows and combines it back into the original frame. It also uses the large general function change_column_order from this answer .

 def change_column_order(df, col_name, index): cols = df.columns.tolist() cols.remove(col_name) cols.insert(index, col_name) return df[cols] def split_df(dataframe, col_name, sep): orig_col_index = dataframe.columns.tolist().index(col_name) orig_index_name = dataframe.index.name orig_columns = dataframe.columns dataframe = dataframe.reset_index() # we need a natural 0-based index for proper merge index_col_name = (set(dataframe.columns) - set(orig_columns)).pop() df_split = pd.DataFrame( pd.DataFrame(dataframe[col_name].str.split(sep).tolist()) .stack().reset_index(level=1, drop=1), columns=[col_name]) df = dataframe.drop(col_name, axis=1) df = pd.merge(df, df_split, left_index=True, right_index=True, how='inner') df = df.set_index(index_col_name) df.index.name = orig_index_name # merge adds the column to the last place, so we need to move it back return change_column_order(df, col_name, orig_col_index) 

Example:

 df = pd.DataFrame([['a:b', 1, 4], ['c:d', 2, 5], ['e:f:g:h', 3, 6]], columns=['Name', 'A', 'B'], index=[10, 12, 13]) df Name AB 10 a:b 1 4 12 c:d 2 5 13 e:f:g:h 3 6 split_df(df, 'Name', ':') Name AB 10 a 1 4 10 b 1 4 12 c 2 5 12 d 2 5 13 e 3 6 13 f 3 6 13 g 3 6 13 h 3 6 

Note that it preserves the original index and column order. It also works with dataframes that have an unclassified index.

+2
Jan 05 '18 at 20:16
source share

It is possible to split and split the data frame without changing the structure of the data frame

Entrance:

  var1 var2 0 a,b,c 1 1 d,e,f 2 #Get the indexes which are repetative with the split df = df.reindex(df.index.repeat(df['var1'].str.split(',').apply(len))) #Assign the split values to dataframe column df['var1'] = np.hstack(df['var1'].drop_duplicates().str.split(',')) 

Out:

  var1 var2 0 a 1 0 b 1 0 c 1 1 d 2 1 e 2 1 f 2 
+2
Oct 24 '18 at 16:29
source share

updated MaxU answer with MultiIndex support

 def explode(df, lst_cols, fill_value='', preserve_index=False): """ usage: In [134]: df Out[134]: aaa myid num text 0 10 1 [1, 2, 3] [aa, bb, cc] 1 11 2 [] [] 2 12 3 [1, 2] [cc, dd] 3 13 4 [] [] In [135]: explode(df, ['num','text'], fill_value='') Out[135]: aaa myid num text 0 10 1 1 aa 1 10 1 2 bb 2 10 1 3 cc 3 11 2 4 12 3 1 cc 5 12 3 2 dd 6 13 4 """ # make sure 'lst_cols' is list-alike if (lst_cols is not None and len(lst_cols) > 0 and not isinstance(lst_cols, (list, tuple, np.ndarray, pd.Series))): lst_cols = [lst_cols] # all columns except 'lst_cols' idx_cols = df.columns.difference(lst_cols) # calculate lengths of lists lens = df[lst_cols[0]].str.len() # preserve original index values idx = np.repeat(df.index.values, lens) res = (pd.DataFrame({ col:np.repeat(df[col].values, lens) for col in idx_cols}, index=idx) .assign(**{col:np.concatenate(df.loc[lens>0, col].values) for col in lst_cols})) # append those rows that have empty lists if (lens == 0).any(): # at least one list in cells is empty res = (res.append(df.loc[lens==0, idx_cols], sort=False) .fillna(fill_value)) # revert the original index order res = res.sort_index() # reset index if requested if not preserve_index: res = res.reset_index(drop=True) # if original index is MultiIndex build the dataframe from the multiindex # create "exploded" DF if isinstance(df.index, pd.MultiIndex): res = res.reindex( index=pd.MultiIndex.from_tuples( res.index, names=['number', 'color'] ) ) return res 
+1
May 27 '19 at 8:55
source share

I came up with the following solution to this problem:

 def iter_var1(d): for _, row in d.iterrows(): for v in row["var1"].split(","): yield (v, row["var2"]) new_a = DataFrame.from_records([i for i in iter_var1(a)], columns=["var1", "var2"]) 
0
Mar 17 '15 at 21:07
source share

Jiln just used a great answer from above, but he needed to expand to split multiple columns. I think I would share.

 def splitDataFrameList(df,target_column,separator): ''' df = dataframe to split, target_column = the column containing the values to split separator = the symbol used to perform the split returns: a dataframe with each entry for the target column separated, with each element moved into a new row. The values in the other columns are duplicated across the newly divided rows. ''' def splitListToRows(row, row_accumulator, target_columns, separator): split_rows = [] for target_column in target_columns: split_rows.append(row[target_column].split(separator)) # Seperate for multiple columns for i in range(len(split_rows[0])): new_row = row.to_dict() for j in range(len(split_rows)): new_row[target_columns[j]] = split_rows[j][i] row_accumulator.append(new_row) new_rows = [] df.apply(splitListToRows,axis=1,args = (new_rows,target_column,separator)) new_df = pd.DataFrame(new_rows) return new_df 
0
Jun 19 '16 at 15:42 on
source share

Another solution using python copy package

 import copy new_observations = list() def pandas_explode(df, column_to_explode): new_observations = list() for row in df.to_dict(orient='records'): explode_values = row[column_to_explode] del row[column_to_explode] if type(explode_values) is list or type(explode_values) is tuple: for explode_value in explode_values: new_observation = copy.deepcopy(row) new_observation[column_to_explode] = explode_value new_observations.append(new_observation) else: new_observation = copy.deepcopy(row) new_observation[column_to_explode] = explode_values new_observations.append(new_observation) return_df = pd.DataFrame(new_observations) return return_df df = pandas_explode(df, column_name) 
0
Jun 18 '17 at 10:27
source share

The following approach combines the new df with the original.

 a.reset_index().merge( a['var1'].str.split(',').apply(_pd.Series).reset_index().melt('index')[['index', 'value']].dropna() )[['value', 'var2']].rename({'value':'var1'}, axis = 1) 
0
Apr 2 '19 at 19:45
source share

There are many answers here, but I am surprised that no one has mentioned the built-in panda diversity function. Check out the link below: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.explode.html#pandas.DataFrame.explode

For some reason, I was not able to access this function, so I used the following code:

 import pandas_explode pandas_explode.patch() df_zlp_people_cnt3 = df_zlp_people_cnt2.explode('people') 

enter image description here

Above is a sample of my data. As you see, there were several people in the people column, and I tried to blow it up. The code I gave works for list type data. Therefore, try to get text data separated by a comma in the form of a list. Also, since my code uses built-in functions, it works much faster than custom / apply functions.

Note: you may need to set pandas_explode with pip.

0
Aug 02 '19 at 14:02
source share



All Articles