Update: The original answer counts those lines that contain a substring.
To count all occurrences of a substring, you can use .str.count :
In [21]: df = pd.DataFrame(['hello', 'world', 'hehe'], columns=['words']) In [22]: df.words.str.count("he|wo") Out[22]: 0 1 1 1 2 2 Name: words, dtype: int64 In [23]: df.words.str.count("he|wo").sum() Out[23]: 4
The str.contains method accepts a regular expression:
Definition: df.words.str.contains(self, pat, case=True, flags=0, na=nan) Docstring: Check whether given pattern is contained in each string in the array Parameters ---------- pat : string Character sequence or regular expression case : boolean, default True If True, case sensitive flags : int, default 0 (no flags) re module flags, eg re.IGNORECASE na : default NaN, fill value for missing values.
For example:
In [11]: df = pd.DataFrame(['hello', 'world'], columns=['words']) In [12]: df Out[12]: words 0 hello 1 world In [13]: df.words.str.contains(r'[hw]') Out[13]: 0 True 1 True Name: words, dtype: bool In [14]: df.words.str.contains(r'he|wo') Out[14]: 0 True 1 True Name: words, dtype: bool
To count cases, you can simply summarize this logical series:
In [15]: df.words.str.contains(r'he|wo').sum() Out[15]: 2 In [16]: df.words.str.contains(r'he').sum() Out[16]: 1
Andy hayden
source share