How to create new lines in a pandas data frame containing words in a row of an existing row?

I have a DataFrame in pandas with a column named df.strings with lines of text. I would like to get individual words of these rows in my rows with the same meanings for other columns. For example, if I have 3 rows (and an unrelated "Time" column):

  Strings Time 0 The dog 4Pm 1 lazy dog 2Pm 2 The fox 1Pm 

I want newlines to contain words from a string, but otherwise identical columns

 Strings --- Words ---Time "The dog" --- "The" --- 4Pm "The dog" --- "dog" --- 4Pm "lazy dog"--- "lazy"--- 2Pm "lazy dog"--- "dog" --- 2Pm "The fox" --- "The" --- 1Pm "The fox" --- "fox" --- 1Pm 

I know how to break words into lines:

  string_list = '\n'.join(df.Strings.map(str)) word_list = re.findall('[az]+', Strings) 

But how can I get them in a dataframe while keeping the index and other variables? I am using Python 2.7 and pandas 0.10.1.

EDIT: Now I understand how to expand the strings with groupby found in this question :

 def f(group): row = group.irow(0) return DataFrame({'words': re.findall('[az]+',row['Strings'])}) df.groupby('class', group_keys=False).apply(f) 

I would like to keep the other columns. Is it possible?

+8
python pandas
source share
1 answer

Here is my code that does not use groupby() , I think it is faster.

 import pandas as pd import numpy as np import itertools df = pd.DataFrame({ "strings":["the dog", "lazy dog", "The fox jump"], "value":["a","b","c"]}) w = df.strings.str.split() c = w.map(len) idx = np.repeat(c.index, c.values) #words = np.concatenate(w.values) words = list(itertools.chain.from_iterable(w.values)) s = pd.Series(words, index=idx) s.name = "words" print df.join(s) 

Result:

  strings value words 0 the dog a the 0 the dog a dog 1 lazy dog b lazy 1 lazy dog b dog 2 The fox jump c The 2 The fox jump c fox 2 The fox jump c jump 
+12
source share

All Articles