Questions about pandas: expanding a multi-valued column, inverting and grouping

I studied pandas to do some simple calculations in NLP and text mining, but I couldn't figure out how to do them.

Suppose I have the following data frame linking the names of people and their gender:

import pandas people = {'name': ['John Doe', 'Mary Poppins', 'Jane Doe', 'John Cusack'], 'gender': ['M', 'F', 'F', 'M']} df = pandas.DataFrame(people) 

For all lines I want:

  • define name
  • define a list of 3-tiles (sequences of 3 letters contained in a word) obtained on behalf of a person
  • determine, for each pebble, how many men and women contained this pebble on their names.

The goal is to use this as a dataset for training a classifier that can determine if a given name is probably a male or female name.

The first two operations are quite simple:

 def shingles(word, n = 3): return [word[i:i + n] for i in range(len(word) - n + 1)] df['firstname'] = df.name.map(lambda x : x.split()[0]) df['shingles'] = df.firstname.map(shingles) 

result:

 > print df gender name firstname shingles 0 M John Doe John ['joh', 'ohn'] 1 F Mary Poppins Mary ['mar', 'ary'] 2 F Jane Doe Jane ['jan', 'ane'] 3 M John Cusack John ['joh', 'ohn'] 

Now the next step should be done by creating a new data frame with two columns: gender and shingle, which should contain something like:

  gender shingle 0 M joh 1 M ohn 2 F mar 3 F ary (...) 

And then I could be grouped by pebble and floor. Ideally, the result would be:

  shingle num_males num_females 0 joh 2 0 1 ohn 2 0 2 mar 0 1 3 ary 0 1 (...) 

Is there an easy way to expand a multi-valued shingles column so that each row creates multiple rows, one for each value found in the tile list?

Also, if I am a groupby column, then how easy is it to create different columns counting for each possible value of the gender column?


I managed to understand the second part. As an example, to calculate how many men and women for each firstname :

  def countMaleFemale(df): return pandas.Series({'males': df.gender[df.gender == 'M'].count(), 'females': df.gender[df.gender == 'F'].count()}) grouped = df.groupby('first name') 

And then:

print grouped.apply (countMaleFemale)

  females males first name Jane 1 0 John 0 2 Mary 1 0 
+3
python pandas
source share
2 answers

This method should generalize fairly well:

 In [100]: df Out[100]: gender name firstname shingles 0 M John Doe John [Joh, ohn] 1 F Mary Poppins Mary [Mar, ary] 2 F Jane Doe Jane [Jan, ane] 3 M John Cusack John [Joh, ohn] 

First, create an β€œextended” series where each entry is a gallery. Here, the series index is a multi-index, where the first level represents the position of the pebbles, and the second level represents the index of the original DF:

 In [103]: s = df.shingles.apply(lambda x: pandas.Series(x)).unstack(); Out[103]: 0 0 Joh 1 Mar 2 Jan 3 Joh 1 0 ohn 1 ary 2 ane 3 ohn 

Then we can join the created series in the original data frame. You must reset the pointer by lowering the level of the pebble. The resulting series has a source index and a record for each pebble. Combining this into the original dataframe gives:

 In [106]: df2 = df.join(pandas.DataFrame(s.reset_index(level=0, drop=True))); df2 Out[106]: gender name firstname shingles 0 0 M John Doe John [Joh, ohn] Joh 0 M John Doe John [Joh, ohn] ohn 1 F Mary Poppins Mary [Mar, ary] Mar 1 F Mary Poppins Mary [Mar, ary] ary 2 F Jane Doe Jane [Jan, ane] Jan 2 F Jane Doe Jane [Jan, ane] ane 3 M John Cusack John [Joh, ohn] Joh 3 M John Cusack John [Joh, ohn] ohn 

Finally, you can perform your group operation in the Gender field, set aside the returned row, and fill NaN with zeros:

 In [124]: df2.groupby(0, sort=False)['gender'].value_counts().unstack().fillna(0) Out[124]: FM 0 Joh 0 2 ohn 0 2 Mar 1 0 ary 1 0 Jan 1 0 ane 1 0 
+7
source share

It might be easier to create an extended version during the creation of shingles . This question shows how you can use groupby for this kind of extension. Here is an example of what you can do after creating the Name column:

 def shingles(table, n = 3): word = table['first name'].irow(0) shingles = [word[i:i + n] for i in range(len(word) - n + 1)] cols = {col: table[col].irow(0) for col in table.columns} cols['shingle'] = shingles return pandas.DataFrame(cols) >>> df.groupby('name', group_keys=False).apply(shingles) first name gender name shingle 0 Jane F Jane Doe Jan 1 Jane F Jane Doe ane 0 John M John Cusack Joh 1 John M John Cusack ohn 0 John M John Doe Joh 1 John M John Doe ohn 0 Mary F Mary Poppins Mar 1 Mary F Mary Poppins ary 

(I am grouped by name here, not by name, only if there are duplicate names, but assumes that the full name is unique.)

From there, you can group and count whatever you want.

+2
source share

All Articles