Python pandas.Series.str.contains WHOLE WORD

df (Pandas Dataframe) has three lines.

col_name "This is Donald." "His hands are so small" "Why are his fingers so short?" 

I want to extract a string containing "is" and "small".

If i do

 df.col_name.str.contains("is|small", case=False) 

Then he also catches β€œHis,” which I do not want.

Under the query is the right way to catch the whole word in df.series?

 df.col_name.str.contains("\bis\b|\bsmall\b", case=False) 
+5
source share
3 answers

No, regex /bis/b|/bsmall/b will not work because you are using /b , not \b , which means "word boundary".

Change this and you will get a match. I would recommend using

 \b(is|small)\b 

This regex is a little faster and a bit readable, at least for me.

+5
source

Your method (c / b) did not work for me. I'm not sure why you cannot use the boolean operator and (&) since I think you really want to.

This is a dumb way to do this, but it works:

 mask = lambda x: ("is" in x) & ("small" in x) series_name.apply(mask) 
0
source

First, you can convert everything to lowercase, remove punctuation and spaces, and then convert the result to a set of words.

 import string df['words'] = [set(words) for words in df['col_name'] .str.lower() .str.replace('[{0}]*'.format(string.punctuation), '') .str.strip() .str.split() ] >>> df col_name words 0 This is Donald. {this, is, donald} 1 His hands are so small {small, his, so, are, hands} 2 Why are his fingers so short? {short, fingers, his, so, are, why} 

Now you can use logical indexing to see if all of your target words are in these new word sets.

 target_words = ['is', 'small'] # Convert target words to lower case just to be safe. target_words = [word.lower() for word in target_words] df['match'] = df.words.apply(lambda words: all(target_word in words for target_word in target_words)) print(df) # Output: # col_name words match # 0 This is Donald. {this, is, donald} False # 1 His hands are so small {small, his, so, are, hands} False # 2 Why are his fingers so short? {short, fingers, his, so, are, why} False target_words = ['so', 'small'] target_words = [word.lower() for word in target_words] df['match'] = df.words.apply(lambda words: all(target_word in words for target_word in target_words)) print(df) # Output: # Output: # col_name words match # 0 This is Donald. {this, is, donald} False # 1 His hands are so small {small, his, so, are, hands} True # 2 Why are his fingers so short? {short, fingers, his, so, are, why} False 

To extract the corresponding rows:

 >>> df.loc[df.match, 'col_name'] # Output: # 1 His hands are so small # Name: col_name, dtype: object 

To do this all in one expression using logical indexing:

 df.loc[[all(target_word in word_set for target_word in target_words) for word_set in (set(words) for words in df['col_name'] .str.lower() .str.replace('[{0}]*'.format(string.punctuation), '') .str.strip() .str.split())], :] 
0
source

All Articles