The number of occurrences of certain words in the pandas framework

Question

The number of occurrences of certain words in the pandas framework

I want to count the number of occurrences of certain words in a data frame. I know using "str.contains"

a = df2[df2['col1'].str.contains("sample")].groupby('col2').size() n = a.apply(lambda x: 1).sum()

I am currently using the above code. Is there a way to match the regex and get the number of occurrences? In my case, I have a large data frame, and I want to match about 100 rows.

+12

python pandas dataframe

Nilani algiriyage Jul 10 '13 at 14:48

source share

2 answers

To calculate the total number of matches, use s.str.match(...).str.get(0).count() .

If your regular expression matches several unique words that will be counted individually, use s.str.match(...).str.get(0).groupby(lambda x: x).count()

It works as follows:

 In [12]: s Out[12]: 0 ax 1 ay 2 bx 3 by 4 bz dtype: object

The string method match handles regular expressions ...

 In [13]: s.str.match('(b[xy]+)') Out[13]: 0 [] 1 [] 2 (bx,) 3 (by,) 4 [] dtype: object

... but the results, as indicated, are not very convenient. The string get method accepts matches as strings and converts empty results to NaNs ...

 In [14]: s.str.match('(b[xy]+)').str.get(0) Out[14]: 0 NaN 1 NaN 2 bx 3 by 4 NaN dtype: object

... which do not count.

 In [15]: s.str.match('(b[xy]+)').str.get(0).count() Out[15]: 2

+4

Dan allan Jul 10 '13 at 15:08

source share

Andy hayden · Accepted Answer · 2013-07-10T15:08:46+0000

Update: The original answer counts those lines that contain a substring.

To count all occurrences of a substring, you can use .str.count :

 In [21]: df = pd.DataFrame(['hello', 'world', 'hehe'], columns=['words']) In [22]: df.words.str.count("he|wo") Out[22]: 0 1 1 1 2 2 Name: words, dtype: int64 In [23]: df.words.str.count("he|wo").sum() Out[23]: 4

The str.contains method accepts a regular expression:

 Definition: df.words.str.contains(self, pat, case=True, flags=0, na=nan) Docstring: Check whether given pattern is contained in each string in the array Parameters ---------- pat : string Character sequence or regular expression case : boolean, default True If True, case sensitive flags : int, default 0 (no flags) re module flags, eg re.IGNORECASE na : default NaN, fill value for missing values.

For example:

 In [11]: df = pd.DataFrame(['hello', 'world'], columns=['words']) In [12]: df Out[12]: words 0 hello 1 world In [13]: df.words.str.contains(r'[hw]') Out[13]: 0 True 1 True Name: words, dtype: bool In [14]: df.words.str.contains(r'he|wo') Out[14]: 0 True 1 True Name: words, dtype: bool

To count cases, you can simply summarize this logical series:

 In [15]: df.words.str.contains(r'he|wo').sum() Out[15]: 2 In [16]: df.words.str.contains(r'he').sum() Out[16]: 1

The number of occurrences of certain words in the pandas framework

More articles: