R combine data frames, allow inaccurate ID matching (for example, with extra characters 1234 corresponds to ab1234)

I am trying to deal with some very dirty data. I need to combine two large data frames that contain different data types by sample ID. The problem is that one table selection identifier is in many different formats, but most of them contain the required identifier string for matching somewhere in their ID, for example. sample "1234" in one table has the identifier "ProjectB (1234)" in another.

I made a minimal reproducible example.

a<-data.frame(aID=c("1234","4567","6789","3645"),aInfo=c("blue","green","goldenrod","cerulean")) b<-data.frame(bID=c("4567","(1234)","6789","23645","63528973"), bInfo=c("apple","banana","kiwi","pomegranate","lychee")) 

Using a merge becomes part of the way:

 merge(a,b, by.x="aID", by.y="bID", all=TRUE) aID aInfo bInfo 1 1234 blue <NA> 2 3645 cerulean <NA> 3 4567 green apple 4 6789 goldenrod kiwi 5 (1234) <NA> banana 6 23645 <NA> pomegranate 7 63528973 <NA> lychee 

but the result, which will be liked, is basically:

  ID aInfo bInfo 1 1234 blue banana 2 3645 cerulean pomegranate 3 4567 green apple 4 6789 goldenrod kiwi 5 63528973 <NA> lychee 

I just wondered if there is a way to include grep in this or another R-tastic method?

Thanks in advance

+7
merge regex grep r dataframe
source share
3 answers

Performing a merge condition is a bit trickier. I don't think you can do it with merge , since it is written, so you need to write a custom function with by . It is rather inefficient, but then this is merge . If you have millions of rows, consider data.table . So you will do an β€œinner join” where only rows that match are returned.

 # I slightly modified your data to test multiple matches a<-data.frame(aID=c("1234","1234","4567","6789","3645"),aInfo=c("blue","blue2","green","goldenrod","cerulean")) b<-data.frame(bID=c("4567","(1234)","6789","23645","63528973"), bInfo=c("apple","banana","kiwi","pomegranate","lychee")) f<-function(x) merge(x,b[agrep(x$aID[1],b$bID),],all=TRUE) do.call(rbind,by(a,a$aID,f)) # aID aInfo bID bInfo # 1234.1 1234 blue (1234) banana # 1234.2 1234 blue2 (1234) banana # 3645 3645 cerulean 23645 pomegranate # 4567 4567 green 4567 apple # 6789 6789 goldenrod 6789 kiwi 

Making a full connection is a little more complicated. This is one way that is still inefficient:

 f<-function(x,b) { matches<-b[agrep(x[1,1],b[,1]),] if (nrow(matches)>0) merge(x,matches,all=TRUE) # Ugly... but how else to create a data.frame full of NAs? else merge(x,b[NA,][1,],all.x=TRUE) } d<-do.call(rbind,by(a,a$aID,f,b)) left.over<-!(b$bID %in% d$bID) rbind(d,do.call(rbind,by(b[left.over,],'bID',f,a))[names(d)]) # aID aInfo bID bInfo # 1234.1 1234 blue (1234) banana # 1234.2 1234 blue2 (1234) banana # 3645 3645 cerulean 23645 pomegranate # 4567 4567 green 4567 apple # 6789 6789 goldenrod 6789 kiwi # bID <NA> <NA> 63528973 lychee 
+3
source share

I would clear your identifiers before the merge. If you know all the weird ways to format bIDs, then you should clear them with gsub() .

In your example, to remove the brackets, I would do something like

 expr <- '\\((.*)\\)' b$bID <- gsub(expr, replace='\\1', b$bID) 

There are a few things going on in expr . Firstly, there is .* , Which is a regular expression for any character any number of times. Wrapping this in parentheses allows gsub to know that we want to keep it and can reference it in a replace statement. To use left and right brackets as characters, we need to avoid them with a double backslash. A combination of all this will read; I want to keep everything between the left bracket and the right bracket.

Note that you can do fancy things with a replace statement, for example replace='id_\\1' .

As for finding an identifier in a numerical sequence, you will have to try a substring or something similar, but I do not consider this a good approach.

Hope this helps.

+1
source share

This is an answer using data.table , inspired by @nograpes.

 ## Create example tables; I added the sarcoline cases ## so there would be examples of rows in a but not b a <- data.table(aID=c("1234","1234","4567","6789","3645","321", "321"), aInfo=c("blue","blue2","green","goldenrod","cerulean", "sarcoline","sarcoline2"), key="aID") b <- data.table(bID=c("4567","(1234)","6789","23645","63528973"), bInfo=c("apple","banana","kiwi","pomegranate","lychee"), key="bID") ## Use agrep to get the rows of b by each aID from a ab <- a[, b[agrep(aID, bID)], by=.(aID, aInfo)] ab ## aID aInfo bID bInfo ## 1: 1234 blue (1234) banana ## 2: 1234 blue2 (1234) banana ## 3: 3645 cerulean 23645 pomegranate ## 4: 4567 green 4567 apple ## 5: 6789 goldenrod 6789 kiwi 

So far, we only had an inner join, so now add unsurpassed rows from the source tables:

 ab <- rbindlist(list(ab, a[!ab[, unique(aID)]], b[!ab[, unique(bID)]]), fill=TRUE) 

These steps are optional and are included to match the OP findings:

 ## Update NA values of aID with the value from bID ab[is.na(aID), aID:=bID] ## Drop the bID column ab[, bID:=NULL] 

Final result

 ab ## aID aInfo bInfo ## 1: 1234 blue banana ## 2: 1234 blue2 banana ## 3: 3645 cerulean pomegranate ## 4: 4567 green apple ## 5: 6789 goldenrod kiwi ## 6: 321 sarcoline NA ## 7: 321 sarcoline2 NA ## 8: 63528973 NA lychee 
+1
source share

All Articles