New answer. I did some pleasure (/ disappointment). I am sure that this is not the fastest solution, but it should pass by you when my other answer has stopped. Let me explain:
dat <- data.table(a = c('cat', 'canine', 'feline', 'dog', 'cat','fido'), b = c('feline', 'puppy', 'meower', 'wolf', 'kitten', 'dog'), c = c('kit', 'barker', 'kitty', 'canine', 'feline','wolf'), d = c('shorthair', 'collie', '', '','',''), e = c(1, 2, 3, 4, 5, 6)) dat[, All := paste(a, b,c),]
Two changes: dat$e now an index column, so this is just the digital position of any row. If e is otherwise important, you can create a new column to replace it.
Below is the first cycle. This makes 3 new columns FirstMatchingID , etc. It's the same as before: they give the index of the earliest (lowest line #) corresponding to dat$All for a b and c .
for(i in 2:nrow(dat)) { x <- grepl(dat[i]$a, dat[i-(1:i)]$All) y <- max(which(x %in% TRUE)) dat[i, FirstMatchingID := dat[iy]$e] x2 <- grepl(dat[i]$b, dat[i-(1:i)]$All) y2 <- max(which(x2 %in% TRUE)) dat[i, SecondMatchingID := dat[i-y2]$e] x3 <- grepl(dat[i]$c, dat[i-(1:i)]$All) y3 <- max(which(x3 %in% TRUE)) dat[i, ThirdMatchingID := dat[i-y3]$e] }
Then we use pmin to find the earliest matching string of the MatchingID columns and set it in our own columns. This is the case if you have a match on line 25 and a match for b on line 12; he will give you 12 (I suppose this is what you would like based on your question).
dat$MinID <- pmin(dat$FirstMatchingID, dat$SecondMatchingID, dat$ThirdMatchingID, na.rm=T)
Finally, this loop will do 3 things, creating a FinalID column with all the corresponding ID numbers from e :
- Where
MinID is NA (no matches), set FinalID to e - If
MinID is a number, find this line (the earliest match) and check if its MinID number; if it is not, there is no match, and it sets FinalID to MinID - Lines that do not meet the above condition are your special cases where line
i the earliest match has an earlier match. This will find a match and set it to FinalID .
for (i in 1:nrow(dat)) { x <- dat[i]$MinID if (is.na(dat[i]$MinID)) { dat[i, FinalID := e] } else if (is.na(dat[x]$MinID)) { dat[i, FinalID := MinID] } else dat[i, FinalID := dat[x]$MinID] }
I think this should do it; let me know how this happens. I do not claim its effectiveness or speed.