Remove punctuation for each row in pandas data frame

Question

Remove punctuation for each row in pandas data frame

I am new to python, so this can be a very simple question. I am trying to use lambda to remove punctuation for each row in a pandas dataframe. I used the following but got an error. I am trying to avoid converting df to a list, and then add the cleared results to a new list, and then convert it back to df.

Any suggestions would be appreciated!

import string df['cleaned'] = df['old'].apply(lambda x: x.replace(c,'') for c in string.punctuation)

+6

python lambda pandas

Rjl Oct 9 '15 at 10:05

source share

2 answers

Using regex is likely to be faster:

 In [11]: RE_PUNCTUATION = '|'.join([re.escape(x) for x in string.punctuation]) # perhaps this is available in the re/regex library? In [12]: s = pd.Series(["a..b", "c<=d", "e|}f"]) In [13]: s.str.replace(RE_PUNCTUATION, "") Out[13]: 0 ab 1 cd 2 ef dtype: object

+4

Andy hayden Oct 9 '15 at 10:42

source share

bernie · Accepted Answer · 2015-10-09T22:13:31+0000

You need to string.punctuation over the string in the data frame, not over string.punctuation . You also need to create a string using .join() .

 df['cleaned'] = df['old'].apply(lambda x:''.join([i for i in x if i not in string.punctuation]))

When lambda expressions become long, it may be more readable to write out the function definition separately, for example. (thanks @AndyHayden for optimization tips):

 def remove_punctuation(s): s = ''.join([i for i in s if i not in frozenset(string.punctuation)]) return s df['cleaned'] = df['old'].apply(remove_punctuation)

Remove punctuation for each row in pandas data frame

More articles: