Pythonic way to calculate the length of lists in pandas data column

I have a dataframe like this:

CreationDate 2013-12-22 15:25:02 [ubuntu, mac-osx, syslinux] 2009-12-14 14:29:32 [ubuntu, mod-rewrite, laconica, apache-2.2] 2013-12-22 15:42:00 [ubuntu, nat, squid, mikrotik] 

I compute the length of the lists in the CreationDate column and create a new Length column as follows:

 df['Length'] = df.CreationDate.apply(lambda x: len(x)) 

What gives me this:

  CreationDate Length 2013-12-22 15:25:02 [ubuntu, mac-osx, syslinux] 3 2009-12-14 14:29:32 [ubuntu, mod-rewrite, laconica, apache-2.2] 4 2013-12-22 15:42:00 [ubuntu, nat, squid, mikrotik] 4 

Is there a more pythonic way to do this?

+32
python pandas
source share
2 answers

You can use the str accessory for some list operations. In this example

 df['CreationDate'].str.len() 

returns the length of each list. See Docs for str.len .

 df['Length'] = df['CreationDate'].str.len() df Out: CreationDate Length 2013-12-22 15:25:02 [ubuntu, mac-osx, syslinux] 3 2009-12-14 14:29:32 [ubuntu, mod-rewrite, laconica, apache-2.2] 4 2013-12-22 15:42:00 [ubuntu, nat, squid, mikrotik] 4 

For these operations, vanilla Python is generally faster. pandas handles NaNs. Here is the timing:

 ser = pd.Series([random.sample(string.ascii_letters, random.randint(1, 20)) for _ in range(10**6)]) %timeit ser.apply(lambda x: len(x)) 1 loop, best of 3: 425 ms per loop %timeit ser.str.len() 1 loop, best of 3: 248 ms per loop %timeit [len(x) for x in ser] 10 loops, best of 3: 84 ms per loop %timeit pd.Series([len(x) for x in ser], index=ser.index) 1 loop, best of 3: 236 ms per loop 
+45
source share

Here is another option using the apply and lambda functions:

 df['Length'] = df["CreationDate"].apply(lambda l: len(l)) 
0
source share

All Articles