Pandas generate columns from single row column

I have a csv file that looks like this:

index,labels 1,created the tower 2,destroyed the tower 3,created the swimming pool 4,destroyed the swimming pool 

Now, if I pass a list of columns that I want instead of a label column (does not contain all the words in the label columns)

 ['created','tower','destroyed','swimming pool'] 

I want to get a dataframe like:

 index,created,destroyed,tower,swimming pool 1,1,0,1,0 2,0,1,1,0 3,1,0,0,1 4,0,1,0,1 

I looked at get_dummies but it didn’t help

+8
python pandas
source share
4 answers

You can call str.contains in a loop.

 print(df) labels 0 created the tower 1 destroyed the tower 2 created the swimming pool 3 destroyed the swimming pool req = ['created', 'destroyed', 'tower', 'swimming pool'] out = pd.concat([df['labels'].str.contains(x) for x in req], 1, keys=req).astype(int) print(out) created destroyed tower swimming pool 0 1 0 1 0 1 0 1 1 0 2 1 0 0 1 3 0 1 0 1 
+8
source share
 import re import pandas as pd df = pd.DataFrame({'index': [1, 2, 3, 4], 'labels': ['created the tower', 'destroyed the tower', 'created the swimming pool', 'destroyed the swimming pool']}) columns = ['created','destroyed','tower','swimming pool'] pat = '|'.join(['({})'.format(re.escape(c)) for c in columns]) result = (df['labels'].str.extractall(pat)).groupby(level=0).count() result.columns = columns print(result) 

gives

  created destroyed tower swimming pool 0 1 0 1 0 1 0 1 1 0 2 1 0 0 1 3 0 1 0 1 

Most of the work is done by str.extractall :

 In [808]: df['labels'].str.extractall(r'(created)|(destroyed)|(tower)|(swimming pool)') Out[808]: 0 1 2 3 match 0 0 created NaN NaN NaN 1 NaN NaN tower NaN 1 0 NaN destroyed NaN NaN 1 NaN NaN tower NaN 2 0 created NaN NaN NaN 1 NaN NaN NaN swimming pool 3 0 NaN destroyed NaN NaN 1 NaN NaN NaN swimming pool 

Since each match is placed on its own line, the desired result can be obtained by performing the groupby/count operation, where we group the first level of the index (source index).


Note that the Python re module has a hard-coded limit on the number of allowed groups:

 /usr/lib/python3.4/sre_compile.py in compile(p, flags) 577 if p.pattern.groups > 100: 578 raise AssertionError( --> 579 "sorry, but this version only supports 100 named groups" 580 ) 581 AssertionError: sorry, but this version only supports 100 named groups 

This limits the extractall approach used above to a maximum of 100 keywords .


Here is an example that shows that cᴏʟᴅsᴘᴇᴇᴅ solution (at least for a certain range of use cases) may be the fastest:

 In [76]: %timeit using_contains(ser, keywords) 10 loops, best of 3: 63.4 ms per loop In [77]: %timeit using_defchararray(ser, keywords) 10 loops, best of 3: 90.6 ms per loop In [78]: %timeit using_extractall(ser, keywords) 10 loops, best of 3: 126 ms per loop 

Here is the setting I used:

 import string import numpy as np import pandas as pd def using_defchararray(ser, keywords): """ https://stackoverflow.com/a/46046558/190597 (piRSquared) """ v = ser.values.astype(str) # >>> (np.core.defchararray.find(v[:, None], columns) >= 0) # array([[ True, False, True, False], # [False, True, True, False], # [ True, False, False, True], # [False, True, False, True]], dtype=bool) result = pd.DataFrame( (np.core.defchararray.find(v[:, None], keywords) >= 0).astype(int), index=ser.index, columns=keywords) return result def using_extractall(ser, keywords): """ https://stackoverflow.com/a/46046417/190597 (unutbu) """ pat = '|'.join(['({})'.format(re.escape(c)) for c in keywords]) result = (ser.str.extractall(pat)).groupby(level=0).count() result.columns = keywords return result def using_contains(ser, keywords): """ https://stackoverflow.com/a/46046142/190597 (cᴏʟᴅsᴘᴇᴇᴅ) """ return (pd.concat([ser.str.contains(x) for x in keywords], axis=1, keys=keywords).astype(int)) def make_random_str_array(letters=string.ascii_letters, strlen=10, size=100): return (np.random.choice(list(letters), size*strlen) .view('|U{}'.format(strlen))) keywords = make_random_str_array(size=99) arr = np.random.choice(keywords, size=(1000, 5),replace=True) ser = pd.Series([' '.join(row) for row in arr]) 

Be sure to check the benchmarks on your own machine and with a setting similar to your use case. Results may vary depending on many factors, such as Series size, ser , keywords length, hardware, OS, version of NumPy, Pandas and Python, and how they were compiled.

+9
source share

Use numpy.core.defchararray.find and numpy braodcasting

 from numpy.core.defchararray import find v = df['labels'].values.astype(str) l = ['created','tower','destroyed','swimming pool'] pd.DataFrame( (find(v[:, None], l) >= 0).astype(int), df.index, l ) created tower destroyed swimming pool index 1 1 1 0 0 2 0 1 1 0 3 1 0 0 1 4 0 0 1 1 

find will translate the str.find function str.find to the size of the string arrays that we provide. find returns the position in the string from the first array, first the found string from the second. If it is not found, it returns -1 . Because of this, we can evaluate whether a string is found by evaluating whether the return value of find greater than or equal to 0 .

+7
source share

In your case, if the words break the , you can simply use the following to achieve it. (PS: It’s better for you to use the COLDSPEED answer when the word break is not only the )

 pd.get_dummies(df['labels'].str.split('the').apply(pd.Series)) Out[424]: 0_created 0_destroyed 1_ swimming pool 1_ tower 0 1 0 0 1 1 0 1 0 1 2 1 0 1 0 3 0 1 1 0 
+4
source share

All Articles