Merge data frames and merge columns into one

I have the following three frames:

df1 <- data.frame(name=c("John", "Anne", "Christine", "Andy"), age=c(31, 26, 54, 48), height=c(180, 175, 160, 168), group=c("Student",3,5,"Employer"), stringsAsFactors=FALSE) df2 <- data.frame(name=c("Anne", "Christine"), age=c(26, 54), height=c(175, 160), group=c(3,5), group2=c("Teacher",6), stringsAsFactors=FALSE) df2 <- data.frame(name=c("Christine"), age=c(54), height=c(160), group=c(5), group2=c(6), group3=c("Scientist"), stringsAsFactors=FALSE) 

I would like to combine them to get the following result:

 df.all <- data.frame(name=c("John", "Anne", "Christine", "Andy"), age=c(31, 26, 54, 48), height=c(180, 175, 160, 168), group=c("Student", "Teacher", "Scientist", "Employer")) 

At the moment, I am doing it like this:

 df.all <- merge(merge(df1[,c(1,4)], df2[,c(1,5)], all=TRUE, by="name"), df3[,c(1,6)], all=TRUE, by="name") row.ind <- which(df.all$group %in% c(6,5)) df.all[row.ind, c("group")] <- df.all[row.ind, c("group2")] row.ind2 <- which(df.all$group2 %in% c(6)) df.all[row.ind2, c("group")] <- df.all[row.ind2, c("group3")] 

This is not generalized, and it is really dirty. Maybe there is a way to use merge_all or merge_recurse for the merge step (especially since more than two data files can be merged), but I did not understand how to do this. These two results do not give the correct result:

 df.all <- merge_all(list(df1, df2, df3)) df.all <- merge_recurse(list(df1, df2, df3), by=c("name")) 

Is there a more general and elegant way to solve this problem?

+4
source share
2 answers

Here is another possible approach if I understand what you end up after. (It is not clear that the numerical values ​​in the β€œgroup” columns are, so I'm not sure if this is exactly what you are looking for.)

Use Reduce() to combine multiple data.frame s.

 temp <- Reduce(function(x, y) merge(x, y, all=TRUE), list(df1, df2, df3)) names(temp)[4] <- "group1" # Rename "group" to "group1" for reshaping temp # name age height group1 group2 group3 # 1 Andy 48 168 Employer <NA> <NA> # 2 Anne 26 175 3 Teacher <NA> # 3 Christine 54 160 5 6 Scientist # 4 John 31 180 Student <NA> <NA> 

Use reshape() to change your data from wide to long.

 df.all <- reshape(temp, direction = "long", idvar="name", varying=4:6, sep="") df.all # name age height time group # Andy.1 Andy 48 168 1 Employer # Anne.1 Anne 26 175 1 3 # Christine.1 Christine 54 160 1 5 # John.1 John 31 180 1 Student # Andy.2 Andy 48 168 2 <NA> # Anne.2 Anne 26 175 2 Teacher # Christine.2 Christine 54 160 2 6 # John.2 John 31 180 2 <NA> # Andy.3 Andy 48 168 3 <NA> # Anne.3 Anne 26 175 3 <NA> # Christine.3 Christine 54 160 3 Scientist # John.3 John 31 180 3 <NA> 

Take advantage of the fact that as.numeric() will force characters to NA and use na.omit() to delete all lines with NA values.

 na.omit(df.all[is.na(as.numeric(df.all$group)), ]) # name age height time group # Andy.1 Andy 48 168 1 Employer # John.1 John 31 180 1 Student # Anne.2 Anne 26 175 2 Teacher # Christine.3 Christine 54 160 3 Scientist 

Again, this may be an overly generalization of your problem - for example, there may be NA values ​​in other columns, but it can help you solve the problem.

+5
source

The first step is to use merge_recurse with all.x = TRUE :

 library(reshape) merge.all <- merge_recurse(list(df1, df2, df3), all.x = TRUE) # name age height group group2 group3 # 1 Anne 26 175 3 Teacher <NA> # 2 Christine 54 160 5 6 Scientist # 3 John 31 180 Student <NA> <NA> # 4 Andy 48 168 Employer <NA> <NA> 

Then you can use apply to get the last non- NA group from all the columns of the group:

 group.cols <- grep("group", colnames(merge.all)) merge.all <- data.frame(merge.all[-group.cols], group = apply(merge.all[group.cols], 1, function(x)tail(na.omit(x), 1))) # name age height group # 1 Anne 26 175 Teacher # 2 Christine 54 160 Scientist # 3 John 31 180 Student # 4 Andy 48 168 Employer 
+4
source

All Articles