Python pandas.Series.str.contains WHOLE WORD

Question

Python pandas.Series.str.contains WHOLE WORD

df (Pandas Dataframe) has three lines.

col_name "This is Donald." "His hands are so small" "Why are his fingers so short?"

I want to extract a string containing "is" and "small".

If i do

 df.col_name.str.contains("is|small", case=False)

Then he also catches “His,” which I do not want.

Under the query is the right way to catch the whole word in df.series?

 df.col_name.str.contains("\bis\b|\bsmall\b", case=False)

+5

python pandas regex dataframe

Aaron Sep 7 '16 at 0:21

source share

3 answers

Your method (c / b) did not work for me. I'm not sure why you cannot use the boolean operator and (&) since I think you really want to.

This is a dumb way to do this, but it works:

 mask = lambda x: ("is" in x) & ("small" in x) series_name.apply(mask)

0

szeitlin Sep 7 '16 at 0:43

source share

First, you can convert everything to lowercase, remove punctuation and spaces, and then convert the result to a set of words.

 import string df['words'] = [set(words) for words in df['col_name'] .str.lower() .str.replace('[{0}]*'.format(string.punctuation), '') .str.strip() .str.split() ] >>> df col_name words 0 This is Donald. {this, is, donald} 1 His hands are so small {small, his, so, are, hands} 2 Why are his fingers so short? {short, fingers, his, so, are, why}

Now you can use logical indexing to see if all of your target words are in these new word sets.

 target_words = ['is', 'small'] # Convert target words to lower case just to be safe. target_words = [word.lower() for word in target_words] df['match'] = df.words.apply(lambda words: all(target_word in words for target_word in target_words)) print(df) # Output: # col_name words match # 0 This is Donald. {this, is, donald} False # 1 His hands are so small {small, his, so, are, hands} False # 2 Why are his fingers so short? {short, fingers, his, so, are, why} False target_words = ['so', 'small'] target_words = [word.lower() for word in target_words] df['match'] = df.words.apply(lambda words: all(target_word in words for target_word in target_words)) print(df) # Output: # Output: # col_name words match # 0 This is Donald. {this, is, donald} False # 1 His hands are so small {small, his, so, are, hands} True # 2 Why are his fingers so short? {short, fingers, his, so, are, why} False

To extract the corresponding rows:

 >>> df.loc[df.match, 'col_name'] # Output: # 1 His hands are so small # Name: col_name, dtype: object

To do this all in one expression using logical indexing:

 df.loc[[all(target_word in word_set for target_word in target_words) for word_set in (set(words) for words in df['col_name'] .str.lower() .str.replace('[{0}]*'.format(string.punctuation), '') .str.strip() .str.split())], :]

0

Alexander Sep 7 '16 at 1:05

source share

Laurel · Accepted Answer · 2016-09-07T00:54:05+0000

No, regex /bis/b|/bsmall/b will not work because you are using /b , not \b , which means "word boundary".

Change this and you will get a match. I would recommend using

 \b(is|small)\b

This regex is a little faster and a bit readable, at least for me.

Python pandas.Series.str.contains WHOLE WORD

More articles: