import re import pandas as pd df = pd.DataFrame({'index': [1, 2, 3, 4], 'labels': ['created the tower', 'destroyed the tower', 'created the swimming pool', 'destroyed the swimming pool']}) columns = ['created','destroyed','tower','swimming pool'] pat = '|'.join(['({})'.format(re.escape(c)) for c in columns]) result = (df['labels'].str.extractall(pat)).groupby(level=0).count() result.columns = columns print(result)
gives
created destroyed tower swimming pool 0 1 0 1 0 1 0 1 1 0 2 1 0 0 1 3 0 1 0 1
Most of the work is done by str.extractall :
In [808]: df['labels'].str.extractall(r'(created)|(destroyed)|(tower)|(swimming pool)') Out[808]: 0 1 2 3 match 0 0 created NaN NaN NaN 1 NaN NaN tower NaN 1 0 NaN destroyed NaN NaN 1 NaN NaN tower NaN 2 0 created NaN NaN NaN 1 NaN NaN NaN swimming pool 3 0 NaN destroyed NaN NaN 1 NaN NaN NaN swimming pool
Since each match is placed on its own line, the desired result can be obtained by performing the groupby/count operation, where we group the first level of the index (source index).
Note that the Python re module has a hard-coded limit on the number of allowed groups:
/usr/lib/python3.4/sre_compile.py in compile(p, flags) 577 if p.pattern.groups > 100: 578 raise AssertionError( --> 579 "sorry, but this version only supports 100 named groups" 580 ) 581 AssertionError: sorry, but this version only supports 100 named groups
This limits the extractall approach used above to a maximum of 100 keywords .
Here is an example that shows that cᴏʟᴅsᴘᴇᴇᴅ solution (at least for a certain range of use cases) may be the fastest:
In [76]: %timeit using_contains(ser, keywords) 10 loops, best of 3: 63.4 ms per loop In [77]: %timeit using_defchararray(ser, keywords) 10 loops, best of 3: 90.6 ms per loop In [78]: %timeit using_extractall(ser, keywords) 10 loops, best of 3: 126 ms per loop
Here is the setting I used:
import string import numpy as np import pandas as pd def using_defchararray(ser, keywords): """ https://stackoverflow.com/a/46046558/190597 (piRSquared) """ v = ser.values.astype(str)
Be sure to check the benchmarks on your own machine and with a setting similar to your use case. Results may vary depending on many factors, such as Series size, ser , keywords length, hardware, OS, version of NumPy, Pandas and Python, and how they were compiled.