I used the answer that @HYRY gave to write a function that will have a parameter (threshold) that can be used to separate popular values ββand unpopular ones (combined in the βothersβ column).
import pandas as pd import numpy as np
Let me create some data:
df = ['a', 'a', np.nan, np.nan, 'a', np.nan, 'a', 'b', 'b', 'b', 'b', 'b', 'c', 'c', 'd', 'e', 'g', 'g', 'g', 'g'] data = pd.Series(df, name='dums')
Examples of using:
In: dum_sign(data) Out: dums_a dums_b dums_g dums_others 0 1 0 0 0 1 1 0 0 0 2 0 0 0 1 3 0 0 0 1 4 1 0 0 0 5 0 0 0 1 6 1 0 0 0 7 0 1 0 0 8 0 1 0 0 9 0 1 0 0 10 0 1 0 0 11 0 1 0 0 12 0 0 0 1 13 0 0 0 1 14 0 0 0 1 15 0 0 0 1 16 0 0 1 0 17 0 0 1 0 18 0 0 1 0 19 0 0 1 0 In: dum_sign(data, threshold=0.2) Out: dums_b dums_others 0 0 1 1 0 1 2 0 1 3 0 1 4 0 1 5 0 1 6 0 1 7 1 0 8 1 0 9 1 0 10 1 0 11 1 0 12 0 1 13 0 1 14 0 1 15 0 1 16 0 1 17 0 1 18 0 1 19 0 1 In: dum_sign(data, threshold=0) Out: dums_a dums_b dums_c dums_d dums_e dums_g dums_others 0 1 0 0 0 0 0 0 1 1 0 0 0 0 0 0 2 0 0 0 0 0 0 1 3 0 0 0 0 0 0 1 4 1 0 0 0 0 0 0 5 0 0 0 0 0 0 1 6 1 0 0 0 0 0 0 7 0 1 0 0 0 0 0 8 0 1 0 0 0 0 0 9 0 1 0 0 0 0 0 10 0 1 0 0 0 0 0 11 0 1 0 0 0 0 0 12 0 0 1 0 0 0 0 13 0 0 1 0 0 0 0 14 0 0 0 1 0 0 0 15 0 0 0 0 1 0 0 16 0 0 0 0 0 1 0 17 0 0 0 0 0 1 0 18 0 0 0 0 0 1 0 19 0 0 0 0 0 1 0
Any suggestions on handling nans? I believe that Nance cannot be considered as "others."
UPD: I tested it on a fairly large dataset (5 mil. Vol.) With 183 different rows in the column that I wanted to use. Implementation takes 10 seconds on my laptop.
Vladimir Yashin
source share