I studied pandas to do some simple calculations in NLP and text mining, but I couldn't figure out how to do them.
Suppose I have the following data frame linking the names of people and their gender:
import pandas people = {'name': ['John Doe', 'Mary Poppins', 'Jane Doe', 'John Cusack'], 'gender': ['M', 'F', 'F', 'M']} df = pandas.DataFrame(people)
For all lines I want:
- define name
- define a list of 3-tiles (sequences of 3 letters contained in a word) obtained on behalf of a person
- determine, for each pebble, how many men and women contained this pebble on their names.
The goal is to use this as a dataset for training a classifier that can determine if a given name is probably a male or female name.
The first two operations are quite simple:
def shingles(word, n = 3): return [word[i:i + n] for i in range(len(word) - n + 1)] df['firstname'] = df.name.map(lambda x : x.split()[0]) df['shingles'] = df.firstname.map(shingles)
result:
> print df gender name firstname shingles 0 M John Doe John ['joh', 'ohn'] 1 F Mary Poppins Mary ['mar', 'ary'] 2 F Jane Doe Jane ['jan', 'ane'] 3 M John Cusack John ['joh', 'ohn']
Now the next step should be done by creating a new data frame with two columns: gender and shingle, which should contain something like:
gender shingle 0 M joh 1 M ohn 2 F mar 3 F ary (...)
And then I could be grouped by pebble and floor. Ideally, the result would be:
shingle num_males num_females 0 joh 2 0 1 ohn 2 0 2 mar 0 1 3 ary 0 1 (...)
Is there an easy way to expand a multi-valued shingles column so that each row creates multiple rows, one for each value found in the tile list?
Also, if I am a groupby column, then how easy is it to create different columns counting for each possible value of the gender column?
I managed to understand the second part. As an example, to calculate how many men and women for each firstname :
def countMaleFemale(df): return pandas.Series({'males': df.gender[df.gender == 'M'].count(), 'females': df.gender[df.gender == 'F'].count()}) grouped = df.groupby('first name')
And then:
print grouped.apply (countMaleFemale)
females males first name Jane 1 0 John 0 2 Mary 1 0