Get a subset of the most common dummy variables in pandas

Question

Get a subset of the most common dummy variables in pandas

I'm trying to do some linear regression analysis, I have some categorical functions that I convert into dummy variables using super awesome get_dummies.

The problem I am facing is that the data size is too large when I add all category items.

Is there a way (using get_dummies or a more complex method) to simply create dummy variables for the most common terms instead of all of them?

+7

python pandas

Manuel g Aug 2 '13 at 12:06

source share

3 answers

Hyry · Answer 1 · 2013-08-02T12:48:54+0000

use value_counts() to count the frequency, and then create a mask for the remaining lines:

 import pandas as pd values = pd.Series(["a","b","a","b","c","d","e","a"]) counts = pd.value_counts(values) mask = values.isin(counts[counts > 1].index) print pd.get_dummies(values[mask])

exit:

  ab 0 1 0 1 0 1 2 1 0 3 0 1 7 1 0

if you want all the data:

 values[~mask] = "-" print pd.get_dummies(values)

exit:

  - ab 0 0 1 0 1 0 0 1 2 0 1 0 3 0 0 1 4 1 0 0 5 1 0 0 6 1 0 0 7 0 1 0

Andy hayden · Answer 2 · 2013-08-02T12:40:43+0000

You can use value_counts first to find out which ones are most common:

 In [11]: s = pd.Series(list('aabccc')) In [12]: s Out[12]: 0 a 1 a 2 b 3 c 4 c 5 c dtype: object In [13]: s.value_counts() Out[13]: c 3 a 2 b 1 dtype: int64

Values that are the least frequent (for example, all but the first two):

 In [14]: s.value_counts().index[2:] Out[14]: Index([u'b'], dtype=object)

You can simply replace all of these occurrences with NaN:

 In [15]: s1 = s.replace(s.value_counts().index[2:], np.nan) In [16]: s1 Out[16]: 0 a 1 a 2 NaN 3 c 4 c 5 c dtype: object

and do get_dummies (which I think should ignore NaN, but there is an error, therefore, notnull hack):

 In [16]: pd.get_dummies(s1[s1.notnull()]) Out[16]: ac 0 1 0 1 1 0 3 0 1 4 0 1 5 0 1

If you want to include these results, you can use a different placeholder (e.g. '_' ).

Vladimir Yashin · Answer 3 · 2016-10-27T08:40:38+0000

I used the answer that @HYRY gave to write a function that will have a parameter (threshold) that can be used to separate popular values and unpopular ones (combined in the “others” column).

 import pandas as pd import numpy as np # func that returns a dummified DataFrame of significant dummies in a given column def dum_sign(dummy_col, threshold=0.1): # removes the bind dummy_col = dummy_col.copy() # what is the ratio of a dummy in whole column count = pd.value_counts(dummy_col) / len(dummy_col) # cond whether the ratios is higher than the threshold mask = dummy_col.isin(count[count > threshold].index) # replace the ones which ratio is lower than the threshold by a special name dummy_col[~mask] = "others" return pd.get_dummies(dummy_col, prefix=dummy_col.name) #

Let me create some data:

 df = ['a', 'a', np.nan, np.nan, 'a', np.nan, 'a', 'b', 'b', 'b', 'b', 'b', 'c', 'c', 'd', 'e', 'g', 'g', 'g', 'g'] data = pd.Series(df, name='dums')

Examples of using:

  In: dum_sign(data) Out: dums_a dums_b dums_g dums_others 0 1 0 0 0 1 1 0 0 0 2 0 0 0 1 3 0 0 0 1 4 1 0 0 0 5 0 0 0 1 6 1 0 0 0 7 0 1 0 0 8 0 1 0 0 9 0 1 0 0 10 0 1 0 0 11 0 1 0 0 12 0 0 0 1 13 0 0 0 1 14 0 0 0 1 15 0 0 0 1 16 0 0 1 0 17 0 0 1 0 18 0 0 1 0 19 0 0 1 0 In: dum_sign(data, threshold=0.2) Out: dums_b dums_others 0 0 1 1 0 1 2 0 1 3 0 1 4 0 1 5 0 1 6 0 1 7 1 0 8 1 0 9 1 0 10 1 0 11 1 0 12 0 1 13 0 1 14 0 1 15 0 1 16 0 1 17 0 1 18 0 1 19 0 1 In: dum_sign(data, threshold=0) Out: dums_a dums_b dums_c dums_d dums_e dums_g dums_others 0 1 0 0 0 0 0 0 1 1 0 0 0 0 0 0 2 0 0 0 0 0 0 1 3 0 0 0 0 0 0 1 4 1 0 0 0 0 0 0 5 0 0 0 0 0 0 1 6 1 0 0 0 0 0 0 7 0 1 0 0 0 0 0 8 0 1 0 0 0 0 0 9 0 1 0 0 0 0 0 10 0 1 0 0 0 0 0 11 0 1 0 0 0 0 0 12 0 0 1 0 0 0 0 13 0 0 1 0 0 0 0 14 0 0 0 1 0 0 0 15 0 0 0 0 1 0 0 16 0 0 0 0 0 1 0 17 0 0 0 0 0 1 0 18 0 0 0 0 0 1 0 19 0 0 0 0 0 1 0

Any suggestions on handling nans? I believe that Nance cannot be considered as "others."

UPD: I tested it on a fairly large dataset (5 mil. Vol.) With 183 different rows in the column that I wanted to use. Implementation takes 10 seconds on my laptop.

Get a subset of the most common dummy variables in pandas

More articles: