I have a big csv with two lines per line in this form:
g,k a,h c,i j,e d,i i,h b,b d,d i,a d,h
I read in the first two columns and transcode the rows into integers as follows:
import pandas as pd df = pd.read_csv("test.csv", usecols=[0,1], prefix="ID_", header=None) from sklearn.preprocessing import LabelEncoder
This code is from https://stackoverflow.com> .
The code works very well, but slow when df is big. I timed each step, and the result was unexpected for me.
pd.read_csv takes about 40 seconds.le.fit(df.values.flat) takes about 30 seconds.df = df.apply(le.transform) takes about 250 seconds.
Is there a way to speed up this last step? It seems like it should be the fastest step for everyone!
Additional timings for the transcoding stage on a computer with 4 GB of RAM
The answer below maxymoo is quick but does not give the correct answer. Taking the csv example from the top of the question, he translates it into:
0 1 0 4 6 1 0 4 2 2 5 3 6 3 4 3 5 5 5 4 6 1 1 7 3 2 8 5 0 9 3 4
Note that 'd' maps to 3 in the first column, but 2 in the second.
I tried the solution from https://stackoverflow.com/a/3/920947/ and got the following.
df = pd.DataFrame({'ID_0':np.random.randint(0,1000,1000000), 'ID_1':np.random.randint(0,1000,1000000)}).astype(str) df.info() memory usage: 7.6MB %timeit x = (df.stack().astype('category').cat.rename_categories(np.arange(len(df.stack().unique()))).unstack()) 1 loops, best of 3: 1.7 s per loop
Then I increased the data size by 10 times.
df = pd.DataFrame({'ID_0':np.random.randint(0,1000,10000000), 'ID_1':np.random.randint(0,1000,10000000)}).astype(str) df.info() memory usage: 76.3+ MB %timeit x = (df.stack().astype('category').cat.rename_categories(np.arange(len(df.stack().unique()))).unstack()) MemoryError Traceback (most recent call last)
This method seems to use so much RAM, trying to translate this relatively small data frame, which it crashes.
I also timed LabelEncoder to a larger dataset with 10 million rows. It runs smoothly, but only the fit line took 50 seconds. The df.apply (le.transform) step took about 80 seconds.
How can I:
- Get something roughly from maxymoo's response speed and about LabelEncoder's memory usage, but this gives the correct answer when there are two columns in the dataframe.
- Keep the mapping so I can reuse it for different data (how can LabelEncoder do this)?