Slice / split string series at different positions

Question

Slice / split string series at different positions

I want to split the Series string at different points depending on the length of the specific substrings:

In [47]: df = pd.DataFrame(['group9class1', 'group10class2', 'group11class20'], columns=['group_class']) In [48]: split_locations = df.group_class.str.rfind('class') In [49]: split_locations Out[49]: 0 6 1 7 2 7 dtype: int64 In [50]: df Out[50]: group_class 0 group9class1 1 group10class2 2 group11class20

My output should look like this:

  group_class group class 0 group9class1 group9 class1 1 group10class2 group10 class2 2 group11class20 group11 class20

I thought this might work:

 In [56]: df.group_class.str[:split_locations] Out[56]: 0 NaN 1 NaN 2 NaN

How can I slice rows at variable locations in split_locations ?

+4

python pandas

London rob Aug 7 '15 at 15:14

source share

3 answers

Use regex to split line

  import re regex = re.compile("(class)") str="group1class23" # this will split the group and the class string by adding a space between them, and using a simple split on space. split_string = re.sub(regex, " \\1", str).split(" ")

This will return an array:

  ['group9', 'class23']

So, to add two new columns to your DataFrame , you can do:

 new_cols = [re.sub(regex, " \\1", x).split(" ") for x in df.group_class] df['group'], df['class'] = zip(*new_cols)

Result:

  group_class group class 0 group9class1 group9 class1 1 group10class2 group10 class2 2 group11class20 group11 class20

+2

Rob Aug 7 '15 at 15:28

source share

You can also use zip along with the list.

 df['group'], df['class'] = zip(*[(string[:n], string[n:]) for string, n in zip(df.group_class, split_locations)]) >>> df group_class group class 0 group9class1 group9 class1 1 group10class2 group10 class2 2 group11class20 group11 class20

+1

Alexander Aug 7 '15 at 15:30

source share

Edchum · Accepted Answer · 2015-08-07T15:23:50+0000

This works using double [[]] , you can access the index value of the current element so that you can index into a series of split_locations :

 In [119]: df[['group_class']].apply(lambda x: pd.Series([x.str[split_locations[x.name]:][0], x.str[:split_locations[x.name]][0]]), axis=1) Out[119]: 0 1 0 class1 group9 1 class2 group10 2 class20 group11

Or, since @ajcr suggested that you can extract :

 In [106]: df['group_class'].str.extract(r'(?P<group>group[0-9]+)(?P<class>class[0-9]+)') Out[106]: group class 0 group9 class1 1 group10 class2 2 group11 class20

EDIT

Regex explanation:

the regular expression came from @ajcr (thanks!), it uses str.extract to extract the groups, the groups become new columns.

So ?P<group> here identifies the identifier for a specific group to look for, if it is missing, an int will be returned for the column name.

so the rest should be self-evident: group[0-9] looks for the string group followed by numbers in the range [0-9] , which indicates [] , this is equivalent to group\d , where \d means a digit.

Therefore, it can be rewritten as:

 df['group_class'].str.extract(r'(?P<group>group\d+)(?P<class>class\d+)')

Slice / split string series at different positions

More articles: