Reindexing after pandas.drop_duplicates

I want to open a file, read it, remove duplicates in two columns of the file, and then use the file without duplicates to perform some calculations. To do this, I use pandas.drop_duplicates, which after removing duplicates also reduces the indexing values. For example, after reset line 1, file1 becomes file2:

file1: Var1 Var2 Var3 Var4 0 52 2 3 89 1 65 2 3 43 2 15 1 3 78 3 33 2 4 67 file2: Var1 Var2 Var3 Var4 0 52 2 3 89 2 15 1 3 78 3 33 2 4 67 

For further use of file2 as a data frame, I need to reindex it to 0, 1, 2, ...

Here is the code I'm using:

 file1 = pd.read_csv("filename.txt",sep='|', header=None, names=['Var1', 'Var2', 'Var3', 'Var4']) file2 = file1.drop_duplicates(["Var2", "Var3"]) # create another variable as a new index: ni file2['ni']= range(0, len(file2)) # this is the line that generates the warning file2 = file2.set_index('ni') 

Although the code works and gives good results, reindexing gives the following warning:

 SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy file2['ni']= range(0, len(file2)) 

I checked the link, but cannot figure out how to change the code. Any ideas on how to fix this?

+5
source share
1 answer

Pandas has a built-in function to perform this task , which will allow you to avoid an error with an alternative, and an easier approach

Instead of adding a new column of consecutive numbers and then setting the index to that column, as you did:

 file2['ni']= range(0, len(file2)) # this is the line that generates the warning file2 = file2.set_index('ni') 

Instead, you can use:

 file2 = file2.reset_index(drop=True) 

The .reset_index() behavior of .reset_index() is to take the current index, insert this index as the first column of the data frame, and then build a new index (I assume that the logic here is that the default behavior makes it very easy to compare the old or new index very useful for verifying performance). drop=True means that instead of saving the old index as a new column, just get rid of it and replace it with a new index that looks the way you want.

all together, your new code might look like this:

 file1 = pd.read_csv("filename.txt",sep='|', header=None, names=['Var1', 'Var2', 'Var3', 'Var4']) file2 = file1.drop_duplicates(["Var2", "Var3"]).reset_index(drop=True) 

See also this question.

+4
source

Source: https://habr.com/ru/post/1214761/


All Articles