How to calculate the number of words per line in a DataFrame?

Question

How to calculate the number of words per line in a DataFrame?

Suppose we have a simple Dataframe

df = pd.DataFrame(['one apple','banana','box of oranges','pile of fruits outside', 'one banana', 'fruits']) df.columns = ['fruits']

how to calculate the number of words in keywords, similar to:

 1 word: 2 2 words: 2 3 words: 1 4 words: 1

+5

python pandas dataframe

Sergei May 27 '16 at 12:21

source share

2 answers

You can use str.count with a space ' ' as a separator.

 In [1716]: count = df['fruits'].str.count(' ').add(1).value_counts(sort=False) In [1717]: count.index = count.index.astype('str') + ' words:' In [1718]: count Out[1718]: 1 words: 2 2 words: 2 3 words: 1 4 words: 1 Name: fruits, dtype: int64

Delay

str.count little faster

<sub> Small sub>

 In [1724]: df.shape Out[1724]: (6, 1) In [1725]: %timeit df['fruits'].str.count(' ').add(1).value_counts(sort=False) 1000 loops, best of 3: 649 µs per loop In [1726]: %timeit df['fruits'].str.split().apply(len).value_counts() 1000 loops, best of 3: 840 µs per loop

<sub> Medium sub>

 In [1728]: df.shape Out[1728]: (6000, 1) In [1729]: %timeit df['fruits'].str.count(' ').add(1).value_counts(sort=False) 100 loops, best of 3: 6.58 ms per loop In [1730]: %timeit df['fruits'].str.split().apply(len).value_counts() 100 loops, best of 3: 6.99 ms per loop

 In [1732]: df.shape Out[1732]: (60000, 1) In [1733]: %timeit df['fruits'].str.count(' ').add(1).value_counts(sort=False) 1 loop, best of 3: 57.6 ms per loop In [1734]: %timeit df['fruits'].str.split().apply(len).value_counts() 1 loop, best of 3: 73.8 ms per loop

+2

Zero Oct 14 '17 at 5:32

source share

Edchum · Accepted Answer · 2016-05-27T12:24:59+0000

IIUC, you can do the following:

 In [89]: count = df['fruits'].str.split().apply(len).value_counts() count.index = count.index.astype(str) + ' words:' count.sort_index(inplace=True) count Out[89]: 1 words: 2 2 words: 2 3 words: 1 4 words: 1 Name: fruits, dtype: int64

Here we use the vector str.split to divide by spaces, and then apply len to get a count of the number of elements, we can then call value_counts to sum the frequency.

Then we rename the index and sort it to get the desired result.

UPDATE

This can also be done using str.len rather than apply , which should scale better:

 In [41]: count = df['fruits'].str.split().str.len() count.index = count.index.astype(str) + ' words:' count.sort_index(inplace=True) count Out[41]: 0 words: 2 1 words: 1 2 words: 3 3 words: 4 4 words: 2 5 words: 1 Name: fruits, dtype: int64

Delay

 In [42]: %timeit df['fruits'].str.split().apply(len).value_counts() %timeit df['fruits'].str.split().str.len() 1000 loops, best of 3: 799 µs per loop 1000 loops, best of 3: 347 µs per loop

For 6K df:

 In [51]: %timeit df['fruits'].str.split().apply(len).value_counts() %timeit df['fruits'].str.split().str.len() 100 loops, best of 3: 6.3 ms per loop 100 loops, best of 3: 6 ms per loop

How to calculate the number of words per line in a DataFrame?

More articles: