"ValueError: labels ['timestamp'] are not contained in the axis" error

I have this code, I want to remove the column timestamp from the file: u.data , but can't.It shows an error
"ValueError: labels ['timestamp'] not contained in the axis" How can I fix it

import numpy as np import pandas as pd import matplotlib.pyplot as plt plt.rc("font", size=14) from sklearn.linear_model import LinearRegression from sklearn.linear_model import Ridge from sklearn.cross_validation import KFold from sklearn.cross_validation import train_test_split data = pd.read_table('u.data') data.columns=['userID', 'itemID','rating', 'timestamp'] data.drop('timestamp', axis=1) N = len(data) print data.shape print list(data.columns) print data.head(10) 
+6
source share
3 answers

One of the biggest problems that you have to deal with and which is invisible is that in the u.data file when inserting the headers, the separation should be exactly the same as the separation between the data line. For example, if a tab is used to separate a tuple, you should not use spaces.

In your u.data file, add headers and separate them just like the number of spaces used between line items. PS: Use sublime text, notepad / notepad ++ sometimes does not work.

+3
source

"ValueError: labels ['timestamp'] are not contained in the axis"

You have no headers in the file, so as you downloaded it, you got df where the column names are the first rows of data. You tried to access a colunm timestamp that does not exist.

Your u.data has no headers in it

 $head u.data 196 242 3 881250949 186 302 3 891717742 

So working with column names will not be possible unless headers are added. You can add headers to the u.data file, for example. I opened it in a text editor and added the line abc timestamp at the top (this is apparently a tab delimited file, so be careful not to use spaces in the header, otherwise it will break the format)

 $head u.data abc timestamp 196 242 3 881250949 186 302 3 891717742 

Now your code works and data.columns returns

 Index([u'a', u'b', u'c', u'timestamp'], dtype='object') 

And the rest of your working code is now

 (100000, 4) # the shape ['a', 'b', 'c', 'timestamp'] # the columns abc timestamp # the df 0 196 242 3 881250949 1 186 302 3 891717742 2 22 377 1 878887116 3 244 51 2 880606923 4 166 346 1 886397596 5 298 474 4 884182806 6 115 265 2 881171488 7 253 465 5 891628467 8 305 451 3 886324817 9 6 86 3 883603013 

If you do not want to add headers

Or you can drop the timestamp of the column using the index (supposedly 3), we can do it with df.ix below, it selects all rows, columns with index 0 to index 2, thereby discarding the column with index 3

 data.ix[:, 0:2] 
+2
source

I would do it like this:

 data = pd.read_table('u.data', header=None, names=['userID', 'itemID','rating', 'timestamp'], usecols=['userID', 'itemID','rating'] ) 

Check:

 In [589]: data.head() Out[589]: userID itemID rating 0 196 242 3 1 186 302 3 2 22 377 1 3 244 51 2 4 166 346 1 
0
source

All Articles