Merge two datasets and create new rows that match

I have two datasets that I would like to join using R -

Dataset 1

ID Name Date Price 1 A 2011 $100 2 B 2012 $200 3 C 2013 $300 

Dataset 2

  ID Date Price 1 2012 $100 1 2013 $200 3 2014 $300 

Using left-join() in dplyr by ID, I would end up with this

  ID Name Date.x Price.x Date.y Price.y 1 A 2011 $100 2012 $100 1 A 2011 $100 2013 $200 2 B 2012 $200 3 C 2013 $300 2014 $300 

What I would like to have as a final product is

  ID Name Date Price 1 A 2011 $100 1 A 2012 $100 1 A 2013 $200 2 B 2012 $200 3 C 2013 $300 3 C 2014 $300 

those. instead of merging with an existing row, I would like to create a new row when a match is found and duplicate existing information that will not change (ID and name), and if necessary, change the "Date and price" column. Any ideas on an effective way to do this on a large set data?

+6
source share
7 answers

You asked about the efficient way, so I will introduce data.table:

 library(data.table) setDT(DF1) setDT(DF2) # structure your data so ID attributes are only in an ID table idDT = DF1[, .(ID, Name)] DF1[, Name := NULL] # stack data DT = rbind(DF1, DF2) # grab ID attributes if you really need them DT[idDT, on="ID", Name := i.Name] 

which gives

  ID Date Price Name 1: 1 2011 $100 A 2: 2 2012 $200 B 3: 3 2013 $300 C 4: 1 2012 $100 A 5: 1 2013 $200 A 6: 3 2014 $300 C 

rbind for data.tables is pretty fast. I would not expect efficiency to be a big problem if just two tables are linked.

As for unscrewing the ID attribute, Name, it follows the recommendations of the author of the dplyr package, which refers to it as making the data neat .

+6
source

This is a small change to @Frank's answer. The main problem is that your second table does not have a Name column. This can be obtained quite efficiently using the data.table update on connection.

 require(data.table) dt2[dt1, Name := i.Name, on = "ID"] # by reference, no need to assign the result back 

Now that there is a Name column, we can just rbind get the result.

 ans = rbind(dt1, if (anyNA(dt2$Name)) na.omit(dt2, by="Name") else dt2) 

If necessary, reorder the result by reference using setorder() :

 setorder(ans, ID, Name) # by reference, no need to assign the result back # ID Name Date Price # 1: 1 A 2011 $100 # 2: 1 A 2012 $100 # 3: 1 A 2013 $200 # 4: 2 B 2012 $200 # 5: 3 C 2013 $300 # 6: 3 C 2014 $300 

:= and set* functions in data.table change the input object by reference.


 dt1 = fread('ID Name Date Price 1 A 2011 $100 2 B 2012 $200 3 C 2013 $300') dt2 = fread('ID Date Price 1 2012 $100 1 2013 $200 3 2014 $300') 
+4
source
 df1 <- data.frame( ID=1:3, Name=c("A","B","C"), Date=c(2011,2012,2013), Price=c(100,200,300) ) df2 <- data.frame( ID=c(1,1,3), Date=c(2012,2013,2014), Price=c(100,200,300) ) 

left_join will not get the desired result. You can use full_join .

 merged <- full_join(df1, df2, by=c("Date","ID")) 

Here you can access the exit with melt from the reshape2 package:

 library(reshape2) merged <- melt(merged, id.vars=c("ID","Name","Date")) 

Then:

 > merged[na.omit(merged$Name), -4] #remove NAs and column from melt ID Name Date value 1 1 A 2011 100 2 2 B 2012 200 3 3 C 2013 300 1.1 1 A 2011 100 2.1 2 B 2012 200 3.1 3 C 2013 300 
+1
source

Internal join with nomatch = 0 . For example, if all IDs in dataset2 are 4, the inner join will not call NA for non-matching identifiers. If you remove nomatch = 0 then NA will be created.

EDIT: rbindlist wrapper added as per @Arun suggestion

 library("data.table") rbindlist(list(df1, setDT(df1)[i = df2, j = .(ID, Name, Date = i.Date, Price = i.Price), on = .(ID), nomatch = 0])) 

Exit:

  ID Name Date Price 1: 1 A 2011 $100 2: 2 B 2012 $200 3: 3 C 2013 $300 4: 1 A 2012 $100 5: 1 A 2013 $200 6: 3 C 2014 $300 
+1
source

Perhaps one of the effective ways to do this is to merge two steps.

 # create Dataset 1 ID <- 1:3 Name <- c("A", "B", "C") Date <- 2011:2013 Price <- c("$100", "$200", "$300") dataset1 <- data.frame(ID, Name, Date, Price) # Create Dataset 2 ID <- c(1,1,3) Date <- 2012:2014 Price <- c("$100", "$200", "$300") dataset2 <- data.frame(ID, Date, Price) 

Assign missing "Name" values ​​in Dataset 2 using the merge function in package {base}

 dataset2 <- merge(dataset1[c("ID", "Name")], dataset2) 

Data Set Merging

 merge(dataset1, dataset2, all = T) 

What gives:

  ID Name Date Price 1 1 A 2011 $100 2 1 A 2012 $100 3 1 A 2013 $200 4 2 B 2012 $200 5 3 C 2013 $300 6 3 C 2014 $300 
+1
source

You can use Plyr to join and get the names for the second DF and rbind to join the strings.

 library(plyr) ## Add the name column to df2 and get rid of unwanted columns df3 <- join(df2,df1,by = "ID") df3[,6] <- NULL df3[,5] <- NULL combined <- rbind(df1,df3) 
0
source
  > dsa ID Name Date Price 1 1 A 2011 $100 2 2 B 2012 $200 3 3 C 2013 $300 >dsb ID Date Price 1 1 2012 $100 2 1 2013 $200 3 3 2014 $300 >dsb$Name <- NA >dsr <- rbind(dsa,dsb) >dsr$Name <- dsa$Name[match(dsr$ID,dsa$ID)] >dsr ID Name Date Price 1 1 A 2011 $100 2 2 B 2012 $200 3 3 C 2013 $300 4 1 A 2012 $100 5 1 A 2013 $200 6 3 C 2014 $300 

I am new to R. I could not use the full potential of R for better efficiency. But it does the job.

0
source

All Articles