Identify duplicate columns in a data frame

Question

Identify duplicate columns in a data frame

I am new to R and trying to remove duplicate columns from a large data array (50K rows, 215 columns). A frame has a combination of discrete continuous and categorical variables.

My approach was to generate a table for each column in the frame in a list, and then use the duplicated() function to search for rows in a duplicate list, as shown below:

 age=18:29 height=c(76.1,77,78.1,78.2,78.8,79.7,79.9,81.1,81.2,81.8,82.8,83.5) gender=c("M","F","M","M","F","F","M","M","F","M","F","M") testframe = data.frame(age=age,height=height,height2=height,gender=gender,gender2=gender) tables=apply(testframe,2,table) dups=which(duplicated(tables)) testframe <- subset(testframe, select = -c(dups))

This is not very effective, especially for large continuous variables. However, I went along this route because I was not able to get the same result using the summary (note: the following assumes the original testframe containing duplicates):

 summaries=apply(testframe,2,summary) dups=which(duplicated(summaries)) testframe <- subset(testframe, select = -c(dups))

If you run this code, you will see that it only deletes the first duplicate found. I suppose this is because I am doing something wrong. Can someone point out where I'm wrong or, even better, point me in the direction of a better way to remove duplicate columns from a data framework?

+16

r dataframe

Benhealey Mar 22 '12 at 6:31

source share

9 answers

What about:

 testframe[!duplicated(as.list(testframe))]

+18

Mostafa rezaei Nov 05 '15 at 19:04

source share

A nice trick you can use is moving your data frame and then checking for duplicates.

 duplicated(t(testframe))

+4

hshihab Mar 09 '16 at 9:33

source share

 unique(testframe, MARGIN=2)

does not work, although I think it is necessary, so try

 as.data.frame(unique(as.matrix(testframe), MARGIN=2))

or if you are worried about numbers turning into factors,

 testframe[,colnames(unique(as.matrix(testframe), MARGIN=2))]

which produces

  age height gender 1 18 76.1 M 2 19 77.0 F 3 20 78.1 M 4 21 78.2 M 5 22 78.8 F 6 23 79.7 F 7 24 79.9 M 8 25 81.1 M 9 26 81.2 F 10 27 81.8 M 11 28 82.8 F 12 29 83.5 M

+3

Henry Mar 22 '12 at 8:11

source share

Here is a simple command that will work if the duplicated columns of your data frame have the same name:

 testframe[names(testframe)[!duplicated(names(testframe))]]

0

Fabio natalini Mar 09 '18 at 11:46

source share

It is probably best to first find duplicate column names and process them accordingly (e.g. add two, take average, first, last, second, mode, etc.). To find duplicate columns:

 names(df)[duplicated(names(df))]

0

Matt elgazar Aug 27 '19 at 20:22

source share

If the problem is that the data frames have been merged too many times, for example:

  testframe2 <- merge(testframe, testframe, by = c('age'))

It is also good to remove the suffix .x from column names. I applied it here, on top of Mostafa Rezai, an excellent answer:

  testframe2 <- testframe2[!duplicated(as.list(testframe2))] names(testframe2) <- gsub('.x','',names(testframe2))

0

M boulanger Sep 26 '19 at 16:36

source share

How about just:

 unique.matrix(testframe, MARGIN=2)

0

Vitali avagyan Oct 20 '19 at 17:47

source share

In fact, you just need to invert the duplicate result in your code and stick with a subset (which is more readable than writing in imho brackets)

 require(dplyr) iris %>% subset(., select=which(!duplicated(names(.))))

-one

Holger brandl Jan 4 '17 at 9:33

source share

kohske · Accepted Answer · 2012-03-22T07:58:02+0000

You can do with lapply :

 testframe[!duplicated(lapply(testframe, summary))]

summary summarizes the distribution while ignoring the order.

Not 100%, but I would use a digest if the data is huge:

 library(digest) testframe[!duplicated(lapply(testframe, digest))]

Identify duplicate columns in a data frame

More articles: