I am new to R and trying to remove duplicate columns from a large data array (50K rows, 215 columns). A frame has a combination of discrete continuous and categorical variables.
My approach was to generate a table for each column in the frame in a list, and then use the duplicated() function to search for rows in a duplicate list, as shown below:
age=18:29 height=c(76.1,77,78.1,78.2,78.8,79.7,79.9,81.1,81.2,81.8,82.8,83.5) gender=c("M","F","M","M","F","F","M","M","F","M","F","M") testframe = data.frame(age=age,height=height,height2=height,gender=gender,gender2=gender) tables=apply(testframe,2,table) dups=which(duplicated(tables)) testframe <- subset(testframe, select = -c(dups))
This is not very effective, especially for large continuous variables. However, I went along this route because I was not able to get the same result using the summary (note: the following assumes the original testframe containing duplicates):
summaries=apply(testframe,2,summary) dups=which(duplicated(summaries)) testframe <- subset(testframe, select = -c(dups))
If you run this code, you will see that it only deletes the first duplicate found. I suppose this is because I am doing something wrong. Can someone point out where I'm wrong or, even better, point me in the direction of a better way to remove duplicate columns from a data framework?