Merge data frames and merge columns into one

Question

Merge data frames and merge columns into one

I have the following three frames:

df1 <- data.frame(name=c("John", "Anne", "Christine", "Andy"), age=c(31, 26, 54, 48), height=c(180, 175, 160, 168), group=c("Student",3,5,"Employer"), stringsAsFactors=FALSE) df2 <- data.frame(name=c("Anne", "Christine"), age=c(26, 54), height=c(175, 160), group=c(3,5), group2=c("Teacher",6), stringsAsFactors=FALSE) df2 <- data.frame(name=c("Christine"), age=c(54), height=c(160), group=c(5), group2=c(6), group3=c("Scientist"), stringsAsFactors=FALSE)

I would like to combine them to get the following result:

 df.all <- data.frame(name=c("John", "Anne", "Christine", "Andy"), age=c(31, 26, 54, 48), height=c(180, 175, 160, 168), group=c("Student", "Teacher", "Scientist", "Employer"))

At the moment, I am doing it like this:

 df.all <- merge(merge(df1[,c(1,4)], df2[,c(1,5)], all=TRUE, by="name"), df3[,c(1,6)], all=TRUE, by="name") row.ind <- which(df.all$group %in% c(6,5)) df.all[row.ind, c("group")] <- df.all[row.ind, c("group2")] row.ind2 <- which(df.all$group2 %in% c(6)) df.all[row.ind2, c("group")] <- df.all[row.ind2, c("group3")]

This is not generalized, and it is really dirty. Maybe there is a way to use merge_all or merge_recurse for the merge step (especially since more than two data files can be merged), but I did not understand how to do this. These two results do not give the correct result:

 df.all <- merge_all(list(df1, df2, df3)) df.all <- merge_recurse(list(df1, df2, df3), by=c("name"))

Is there a more general and elegant way to solve this problem?

+4

r

AnjaM Dec 14 '12 at 14:56

source share

2 answers

The first step is to use merge_recurse with all.x = TRUE :

 library(reshape) merge.all <- merge_recurse(list(df1, df2, df3), all.x = TRUE) # name age height group group2 group3 # 1 Anne 26 175 3 Teacher <NA> # 2 Christine 54 160 5 6 Scientist # 3 John 31 180 Student <NA> <NA> # 4 Andy 48 168 Employer <NA> <NA>

Then you can use apply to get the last non- NA group from all the columns of the group:

 group.cols <- grep("group", colnames(merge.all)) merge.all <- data.frame(merge.all[-group.cols], group = apply(merge.all[group.cols], 1, function(x)tail(na.omit(x), 1))) # name age height group # 1 Anne 26 175 Teacher # 2 Christine 54 160 Scientist # 3 John 31 180 Student # 4 Andy 48 168 Employer

+4

flodel Dec 14 '12 at 15:18

source share

A5C1D2H2I1M1N2O1R2T1 · Accepted Answer · 2012-12-14T18:26:28+0000

Here is another possible approach if I understand what you end up after. (It is not clear that the numerical values in the “group” columns are, so I'm not sure if this is exactly what you are looking for.)

Use Reduce() to combine multiple data.frame s.

 temp <- Reduce(function(x, y) merge(x, y, all=TRUE), list(df1, df2, df3)) names(temp)[4] <- "group1" # Rename "group" to "group1" for reshaping temp # name age height group1 group2 group3 # 1 Andy 48 168 Employer <NA> <NA> # 2 Anne 26 175 3 Teacher <NA> # 3 Christine 54 160 5 6 Scientist # 4 John 31 180 Student <NA> <NA>

Use reshape() to change your data from wide to long.

 df.all <- reshape(temp, direction = "long", idvar="name", varying=4:6, sep="") df.all # name age height time group # Andy.1 Andy 48 168 1 Employer # Anne.1 Anne 26 175 1 3 # Christine.1 Christine 54 160 1 5 # John.1 John 31 180 1 Student # Andy.2 Andy 48 168 2 <NA> # Anne.2 Anne 26 175 2 Teacher # Christine.2 Christine 54 160 2 6 # John.2 John 31 180 2 <NA> # Andy.3 Andy 48 168 3 <NA> # Anne.3 Anne 26 175 3 <NA> # Christine.3 Christine 54 160 3 Scientist # John.3 John 31 180 3 <NA>

Take advantage of the fact that as.numeric() will force characters to NA and use na.omit() to delete all lines with NA values.

 na.omit(df.all[is.na(as.numeric(df.all$group)), ]) # name age height time group # Andy.1 Andy 48 168 1 Employer # John.1 John 31 180 1 Student # Anne.2 Anne 26 175 2 Teacher # Christine.3 Christine 54 160 3 Scientist

Again, this may be an overly generalization of your problem - for example, there may be NA values in other columns, but it can help you solve the problem.

Merge data frames and merge columns into one

More articles: