Identify duplicate columns in a data frame

I am new to R and trying to remove duplicate columns from a large data array (50K rows, 215 columns). A frame has a combination of discrete continuous and categorical variables.

My approach was to generate a table for each column in the frame in a list, and then use the duplicated() function to search for rows in a duplicate list, as shown below:

 age=18:29 height=c(76.1,77,78.1,78.2,78.8,79.7,79.9,81.1,81.2,81.8,82.8,83.5) gender=c("M","F","M","M","F","F","M","M","F","M","F","M") testframe = data.frame(age=age,height=height,height2=height,gender=gender,gender2=gender) tables=apply(testframe,2,table) dups=which(duplicated(tables)) testframe <- subset(testframe, select = -c(dups)) 

This is not very effective, especially for large continuous variables. However, I went along this route because I was not able to get the same result using the summary (note: the following assumes the original testframe containing duplicates):

 summaries=apply(testframe,2,summary) dups=which(duplicated(summaries)) testframe <- subset(testframe, select = -c(dups)) 

If you run this code, you will see that it only deletes the first duplicate found. I suppose this is because I am doing something wrong. Can someone point out where I'm wrong or, even better, point me in the direction of a better way to remove duplicate columns from a data framework?

+16
r dataframe
source share
9 answers

You can do with lapply :

 testframe[!duplicated(lapply(testframe, summary))] 

summary summarizes the distribution while ignoring the order.

Not 100%, but I would use a digest if the data is huge:

 library(digest) testframe[!duplicated(lapply(testframe, digest))] 
+19
source share

What about:

 testframe[!duplicated(as.list(testframe))] 
+18
source share

A nice trick you can use is moving your data frame and then checking for duplicates.

 duplicated(t(testframe)) 
+4
source share
 unique(testframe, MARGIN=2) 

does not work, although I think it is necessary, so try

 as.data.frame(unique(as.matrix(testframe), MARGIN=2)) 

or if you are worried about numbers turning into factors,

 testframe[,colnames(unique(as.matrix(testframe), MARGIN=2))] 

which produces

  age height gender 1 18 76.1 M 2 19 77.0 F 3 20 78.1 M 4 21 78.2 M 5 22 78.8 F 6 23 79.7 F 7 24 79.9 M 8 25 81.1 M 9 26 81.2 F 10 27 81.8 M 11 28 82.8 F 12 29 83.5 M 
+3
source share

Here is a simple command that will work if the duplicated columns of your data frame have the same name:

 testframe[names(testframe)[!duplicated(names(testframe))]] 
0
source share

It is probably best to first find duplicate column names and process them accordingly (e.g. add two, take average, first, last, second, mode, etc.). To find duplicate columns:

 names(df)[duplicated(names(df))] 
0
source share

If the problem is that the data frames have been merged too many times, for example:

  testframe2 <- merge(testframe, testframe, by = c('age')) 

It is also good to remove the suffix .x from column names. I applied it here, on top of Mostafa Rezai, an excellent answer:

  testframe2 <- testframe2[!duplicated(as.list(testframe2))] names(testframe2) <- gsub('.x','',names(testframe2)) 
0
source share

How about just:

 unique.matrix(testframe, MARGIN=2) 
0
source share

In fact, you just need to invert the duplicate result in your code and stick with a subset (which is more readable than writing in imho brackets)

 require(dplyr) iris %>% subset(., select=which(!duplicated(names(.)))) 
-one
source share

All Articles