Reindexing after pandas.drop_duplicates

Question

Reindexing after pandas.drop_duplicates

I want to open a file, read it, remove duplicates in two columns of the file, and then use the file without duplicates to perform some calculations. To do this, I use pandas.drop_duplicates, which after removing duplicates also reduces the indexing values. For example, after reset line 1, file1 becomes file2:

file1: Var1 Var2 Var3 Var4 0 52 2 3 89 1 65 2 3 43 2 15 1 3 78 3 33 2 4 67 file2: Var1 Var2 Var3 Var4 0 52 2 3 89 2 15 1 3 78 3 33 2 4 67

For further use of file2 as a data frame, I need to reindex it to 0, 1, 2, ...

Here is the code I'm using:

 file1 = pd.read_csv("filename.txt",sep='|', header=None, names=['Var1', 'Var2', 'Var3', 'Var4']) file2 = file1.drop_duplicates(["Var2", "Var3"]) # create another variable as a new index: ni file2['ni']= range(0, len(file2)) # this is the line that generates the warning file2 = file2.set_index('ni')

Although the code works and gives good results, reindexing gives the following warning:

 SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy file2['ni']= range(0, len(file2))

I checked the link, but cannot figure out how to change the code. Any ideas on how to fix this?

+5

python

Brebenel Mar 05 '15 at 18:26

source share

1 answer

cjprybol · Answer 1 · 2015-08-27T18:49:08+0000

Pandas has a built-in function to perform this task , which will allow you to avoid an error with an alternative, and an easier approach

Instead of adding a new column of consecutive numbers and then setting the index to that column, as you did:

 file2['ni']= range(0, len(file2)) # this is the line that generates the warning file2 = file2.set_index('ni')

Instead, you can use:

 file2 = file2.reset_index(drop=True)

The .reset_index() behavior of .reset_index() is to take the current index, insert this index as the first column of the data frame, and then build a new index (I assume that the logic here is that the default behavior makes it very easy to compare the old or new index very useful for verifying performance). drop=True means that instead of saving the old index as a new column, just get rid of it and replace it with a new index that looks the way you want.

all together, your new code might look like this:

 file1 = pd.read_csv("filename.txt",sep='|', header=None, names=['Var1', 'Var2', 'Var3', 'Var4']) file2 = file1.drop_duplicates(["Var2", "Var3"]).reset_index(drop=True)

Reindexing after pandas.drop_duplicates

More articles: