How to speed up LabelEncoder to recode a categorical variable into integers

I have a big csv with two lines per line in this form:

g,k a,h c,i j,e d,i i,h b,b d,d i,a d,h 

I read in the first two columns and transcode the rows into integers as follows:

 import pandas as pd df = pd.read_csv("test.csv", usecols=[0,1], prefix="ID_", header=None) from sklearn.preprocessing import LabelEncoder # Initialize the LabelEncoder. le = LabelEncoder() le.fit(df.values.flat) # Convert to digits. df = df.apply(le.transform) 

This code is from https://stackoverflow.com> .

The code works very well, but slow when df is big. I timed each step, and the result was unexpected for me.

  • pd.read_csv takes about 40 seconds.
  • le.fit(df.values.flat) takes about 30 seconds.
  • df = df.apply(le.transform) takes about 250 seconds.

Is there a way to speed up this last step? It seems like it should be the fastest step for everyone!


Additional timings for the transcoding stage on a computer with 4 GB of RAM

The answer below maxymoo is quick but does not give the correct answer. Taking the csv example from the top of the question, he translates it into:

  0 1 0 4 6 1 0 4 2 2 5 3 6 3 4 3 5 5 5 4 6 1 1 7 3 2 8 5 0 9 3 4 

Note that 'd' maps to 3 in the first column, but 2 in the second.

I tried the solution from https://stackoverflow.com/a/3/920947/ and got the following.

 df = pd.DataFrame({'ID_0':np.random.randint(0,1000,1000000), 'ID_1':np.random.randint(0,1000,1000000)}).astype(str) df.info() memory usage: 7.6MB %timeit x = (df.stack().astype('category').cat.rename_categories(np.arange(len(df.stack().unique()))).unstack()) 1 loops, best of 3: 1.7 s per loop 

Then I increased the data size by 10 times.

 df = pd.DataFrame({'ID_0':np.random.randint(0,1000,10000000), 'ID_1':np.random.randint(0,1000,10000000)}).astype(str) df.info() memory usage: 76.3+ MB %timeit x = (df.stack().astype('category').cat.rename_categories(np.arange(len(df.stack().unique()))).unstack()) MemoryError Traceback (most recent call last) 

This method seems to use so much RAM, trying to translate this relatively small data frame, which it crashes.

I also timed LabelEncoder to a larger dataset with 10 million rows. It runs smoothly, but only the fit line took 50 seconds. The df.apply (le.transform) step took about 80 seconds.

How can I:

  • Get something roughly from maxymoo's response speed and about LabelEncoder's memory usage, but this gives the correct answer when there are two columns in the dataframe.
  • Keep the mapping so I can reuse it for different data (how can LabelEncoder do this)?
+5
source share
2 answers

It looks like it will be much faster to use the pandas category data type; internally this uses a hash table, and LabelEncoder uses a sorted search:

 In [87]: df = pd.DataFrame({'ID_0':np.random.randint(0,1000,1000000), 'ID_1':np.random.randint(0,1000,1000000)}).astype(str) In [88]: le.fit(df.values.flat) %time x = df.apply(le.transform) CPU times: user 6.28 s, sys: 48.9 ms, total: 6.33 s Wall time: 6.37 s In [89]: %time x = df.apply(lambda x: x.astype('category').cat.codes) CPU times: user 301 ms, sys: 28.6 ms, total: 330 ms Wall time: 331 ms 

EDIT: Here is a custom transformer class that you could use (you probably won't see this in the official scikit-learn release, since the maintainers don't want to have pandas as a dependency)

 import pandas as pd from pandas.core.nanops import unique1d from sklearn.base import BaseEstimator, TransformerMixin class PandasLabelEncoder(BaseEstimator, TransformerMixin): def fit(self, y): self.classes_ = unique1d(y) return self def transform(self, y): s = pd.Series(y).astype('category', categories=self.classes_) return s.cat.codes 
+4
source

I tried this with a DataFrame:

 In [xxx]: import string In [xxx]: letters = np.array([c for c in string.ascii_lowercase]) In [249]: df = pd.DataFrame({'ID_0': np.random.choice(letters, 10000000), 'ID_1':np.random.choice(letters, 10000000)}) 

It looks like this:

 In [261]: df.head() Out[261]: ID_0 ID_1 0 vz 1 ii 2 dn 3 zr 4 xx In [262]: df.shape Out[262]: (10000000, 2) 

So, 10 million lines. Locally, my timings are:

 In [257]: % timeit le.fit(df.values.flat) 1 loops, best of 3: 17.2 s per loop In [258]: % timeit df2 = df.apply(le.transform) 1 loops, best of 3: 30.2 s per loop 

Then I made the letters for the numbers and used pandas.Series.map:

 In [248]: letters = np.array([l for l in string.ascii_lowercase]) In [263]: d = dict(zip(letters, range(26))) In [273]: %timeit for c in df.columns: df[c] = df[c].map(d) 1 loops, best of 3: 1.12 s per loop In [274]: df.head() Out[274]: ID_0 ID_1 0 21 25 1 8 8 2 3 13 3 25 17 4 23 23 

So this may be an option. A dict just needs to have all the values ​​that occur in the data.

EDIT: The OP asked a question about what time I have for this second option, with categories. This is what I get:

 In [40]: %timeit x=df.stack().astype('category').cat.rename_categories(np.arange(len(df.stack().unique()))).unstack() 1 loops, best of 3: 13.5 s per loop 

EDIT: for the second comment:

 In [45]: %timeit uniques = np.sort(pd.unique(df.values.ravel())) 1 loops, best of 3: 933 ms per loop In [46]: %timeit dfc = df.apply(lambda x: x.astype('category', categories=uniques)) 1 loops, best of 3: 1.35 s per loop 
+3
source

All Articles