Select only the first row when combining data frames with multiple matches

I have two data frames, “data” and “ratings”, and you want to combine them in the “id” column:

data = data.frame(id = c(1,2,3,4,5), state = c("KS","MN","AL","FL","CA")) scores = data.frame(id = c(1,1,1,2,2,3,3,3), score = c(66,75,78,86,85,76,75,90)) merge(data, scores, by = "id") semi_join(data, scores, by = "id") 

In the data “points” there is an “id” with several observations, where each match receives a row after combining. See ?merge :

If there is more than one match, all possible matches are entered one line at a time.

However, I want to save only the row corresponding to the first match from the scores table.

A semi-join would be nice, but I can't select a grade from the right table.

Any suggestions?

+7
join r
source share
3 answers

Using data.table along with mult = "first" and nomatch = 0L :

 require(data.table) setDT(scores); setDT(data) # convert to data.tables by reference scores[data, mult = "first", on = "id", nomatch=0L] # id score state # 1: 1 66 KS # 2: 2 86 MN # 3: 3 76 AL 

For each row in the data id column, matching rows were found in the scores ' id column, and the first one is saved (because mult = "first" ). If there are no matches, they are deleted (due to nomatch = 0L ).

+10
source share

Here is the basic R method using aggregate and head :

 merge(data, aggregate(score ~ id, data=scores, head, 1), by="id") 

The aggregate function breaks the data frames by identifier, then uses head to get the first observation from each identifier. Since aggregate returns data.frame, it is directly merged with data.frame data.


Probably more effective is a subset of data.frame data using duplicated , which will achieve the same result as aggregate , but reduce computational overhead.

 merge(data, scores[!duplicated(scores$id),], by="id") 
+4
source share

Here is another method using dplyr :: distinct. This is useful if you want to keep all rows from "data", even if there is no match.

 data = data.frame(id=c(1,2,3,4,5), state=c("KS","MN","AL","FL","CA")) scores = data.frame(id=c(1,1,1,2,2,3,3,3), score=c(66,75,78,86,85,76,75,90)) data %>% dplyr::left_join(dplyr::distinct(scores, id, .keep_all = T)) # Joining, by = "id" # id state score # 1 1 KS 66 # 2 2 MN 86 # 3 3 AL 76 # 4 4 FL NA # 5 5 CA NA 

Also, if you want to replace NA in the new data.frame, try the tidyr :: replace_na () function. Example:

 data %>% dplyr::left_join(dplyr::distinct(scores, id, .keep_all = T)) %>% tidyr::replace_na(replace = list("score"=0L)) # Joining, by = "id" # id state score # 1 1 KS 66 # 2 2 MN 86 # 3 3 AL 76 # 4 4 FL 0 # 5 5 CA 0 
+2
source share

All Articles