Multiple column data frame row join in R

Question

Multiple column data frame row join in R

I have a data frame in R that has one per line. Sometimes people appear on two lines, and I would like to combine these lines based on a duplicate identifier.

The problem is that each person has several identifiers, and when the ID appears twice, it does not necessarily appear in the same column .

Here is an example data frame:

dat <- data.frame(a = c('cat', 'canine', 'feline', 'dog'), b = c('feline', 'puppy', 'meower', 'wolf'), c = c('kitten', 'barker', 'kitty', 'canine'), d = c('shorthair', 'collie', '', ''), e = c(1, 5, 3, 8)) > dat abcde 1 cat feline kitten shorthair 1 2 canine puppy barker collie 5 3 feline meower kitty 3 4 dog wolf canine 8

So, lines 1 and 3 must be combined, because identifier b line 1 is equal to ID a line 3. Similarly, identifier a line 2 is equal to ID c line 4, therefore these lines must also be combined.

Ideally, the conclusion should be as follows.

  a.1 b.1 c.1 d.1 e.1 a.2 b.3 c.2 d.2 e.2 1 cat feline kitten shorthair 1 feline meower kitty 3 2 canine puppy barker collie 5 dog wolf canine 8

(Note that the strings were not concatenated based on sharing identifiers, which are empty strings.)

My thoughts on how to do this are listed below, but I’m sure that I started the wrong way, so they probably do not help in solving the problem.

I thought I could assign a row identifier to each row and then melt the data. After that, I could take turns. When I found a row in which one of the identifiers matches the earlier row (for example, when one of the row ID 3 matches one of the row 1 identifiers), I would change each instance of the current row identifier of the row to match the previous row identifier (for example, all line identifiers of 3 will be changed to 1).

Here is the code I used:

 dat$row.id <- 1:nrow(dat) library(reshape2) dat.melt <- melt(dat, id.vars = c('e', 'row.id')) for (i in 2:nrow(dat.melt)) { # This next step is just to ignore the empty values if (grepl('^[[:space:]]*$', dat.melt$value[i])) { next } earlier.instance <- dat.melt$row.id[which(dat.melt$value[1:(i-1)] == dat.melt$value[i])] if (length(earlier.instance) > 0) { earlier.row.id <- earlier.instance[1] dat.melt$row.id[dat.melt$row.id == dat.melt$row.id[i]] <- earlier.row.id } }

There are two problems with this approach.

Maybe the identifier in line 3 corresponds to line 1, and another identifier in line 5 corresponds to line 3. In this case, the line identifiers for lines 3 and 5 should be changed to 1. This means that it is important to go through the lines sequentially, which prompted me to use a for loop, not an application function. I know that this is not very similar to an R-image, and with a large data frame I work with it very slowly.
This code displays the result below. Currently, there are several rows with the same row.id and variable , so I don’t know how to do this to get the view I showed above. Using dcast here will force the aggregation function.

Output:

  e row.id variable value 1 1 3 a cat 2 5 2 a canine 3 3 3 a feline 4 8 2 a dog 5 1 3 b feline 6 5 2 b puppy 7 3 3 b meower 8 8 2 b wolf 9 1 3 c kitten 10 5 2 c barker 11 3 3 c kitty 12 8 2 c canine 13 1 3 d shorthair 14 5 2 d collie 15 3 3 d 16 8 2 d

+6

r dataframe reshape2

njc Sep 13 '16 at 14:10

source share

2 answers

Here is an amateur attempt. I think he does what you need. I expanded data.frame (now data.table) two lines to give a better example.

This loop creates a new dat$FirstMatchingID column that contains the identifier from dat$e for the earliest match. I did this only for the first column of dat$a , but I think it could easily be expanded to b and c .

 library(data.table) dat <- data.table(a = c('cat', 'canine', 'feline', 'dog', 'feline','puppy'), b = c('feline', 'puppy', 'meower', 'wolf', 'kitten', 'dog'), c = c('kitten', 'barker', 'kitty', 'canine', 'cat','wolf'), d = c('shorthair', 'collie', '', '','',''), e = c(1, 5, 3, 8, 4, 6)) dat[, All := paste(a, b,c),] for(i in 2:nrow(dat)) { print(dat[i]) x <- grepl(dat[i]$a, dat[i-(1:i)]$All) y <- max(which(x %in% TRUE)) dat[i, FirstMatchingID := dat[iy]$e] }

Result:

  abcde All FirstMatchingID 1: cat feline kitten shorthair 1 cat feline kitten NA 2: canine puppy barker collie 5 canine puppy barker NA 3: feline meower kitty 3 feline meower kitty 1 4: dog wolf canine 8 dog wolf canine NA 5: feline kitten cat 4 feline kitten cat 1 6: puppy dog wolf 6 puppy dog wolf 5

Then you need to figure out how you want to concatenate the lines to get the desired result, but hopefully this helps!

+1

moman822 Sep 13 '16 at 17:25

source share

moman822 · Accepted Answer · 2016-09-14T01:48:46+0000

New answer. I did some pleasure (/ disappointment). I am sure that this is not the fastest solution, but it should pass by you when my other answer has stopped. Let me explain:

 dat <- data.table(a = c('cat', 'canine', 'feline', 'dog', 'cat','fido'), b = c('feline', 'puppy', 'meower', 'wolf', 'kitten', 'dog'), c = c('kit', 'barker', 'kitty', 'canine', 'feline','wolf'), d = c('shorthair', 'collie', '', '','',''), e = c(1, 2, 3, 4, 5, 6)) dat[, All := paste(a, b,c),]

Two changes: dat$e now an index column, so this is just the digital position of any row. If e is otherwise important, you can create a new column to replace it.

Below is the first cycle. This makes 3 new columns FirstMatchingID , etc. It's the same as before: they give the index of the earliest (lowest line #) corresponding to dat$All for a b and c .

 for(i in 2:nrow(dat)) { x <- grepl(dat[i]$a, dat[i-(1:i)]$All) y <- max(which(x %in% TRUE)) dat[i, FirstMatchingID := dat[iy]$e] x2 <- grepl(dat[i]$b, dat[i-(1:i)]$All) y2 <- max(which(x2 %in% TRUE)) dat[i, SecondMatchingID := dat[i-y2]$e] x3 <- grepl(dat[i]$c, dat[i-(1:i)]$All) y3 <- max(which(x3 %in% TRUE)) dat[i, ThirdMatchingID := dat[i-y3]$e] }

Then we use pmin to find the earliest matching string of the MatchingID columns and set it in our own columns. This is the case if you have a match on line 25 and a match for b on line 12; he will give you 12 (I suppose this is what you would like based on your question).

 dat$MinID <- pmin(dat$FirstMatchingID, dat$SecondMatchingID, dat$ThirdMatchingID, na.rm=T)

Finally, this loop will do 3 things, creating a FinalID column with all the corresponding ID numbers from e :

Where MinID is NA (no matches), set FinalID to e
If MinID is a number, find this line (the earliest match) and check if its MinID number; if it is not, there is no match, and it sets FinalID to MinID
Lines that do not meet the above condition are your special cases where line i the earliest match has an earlier match. This will find a match and set it to FinalID .

for (i in 1:nrow(dat)) { x <- dat[i]$MinID if (is.na(dat[i]$MinID)) { dat[i, FinalID := e] } else if (is.na(dat[x]$MinID)) { dat[i, FinalID := MinID] } else dat[i, FinalID := dat[x]$MinID] }

I think this should do it; let me know how this happens. I do not claim its effectiveness or speed.

Multiple column data frame row join in R

More articles: