Select only the first row when combining data frames with multiple matches

Question

Select only the first row when combining data frames with multiple matches

I have two data frames, “data” and “ratings”, and you want to combine them in the “id” column:

data = data.frame(id = c(1,2,3,4,5), state = c("KS","MN","AL","FL","CA")) scores = data.frame(id = c(1,1,1,2,2,3,3,3), score = c(66,75,78,86,85,76,75,90)) merge(data, scores, by = "id") semi_join(data, scores, by = "id")

In the data “points” there is an “id” with several observations, where each match receives a row after combining. See ?merge :

If there is more than one match, all possible matches are entered one line at a time.

However, I want to save only the row corresponding to the first match from the scores table.

A semi-join would be nice, but I can't select a grade from the right table.

Any suggestions?

+7

join r

Aguy Jun 10 '16 at 13:22

source share

3 answers

Here is the basic R method using aggregate and head :

 merge(data, aggregate(score ~ id, data=scores, head, 1), by="id")

The aggregate function breaks the data frames by identifier, then uses head to get the first observation from each identifier. Since aggregate returns data.frame, it is directly merged with data.frame data.

Probably more effective is a subset of data.frame data using duplicated , which will achieve the same result as aggregate , but reduce computational overhead.

 merge(data, scores[!duplicated(scores$id),], by="id")

+4

lmo Jun 10 '16 at 13:26

source share

Here is another method using dplyr :: distinct. This is useful if you want to keep all rows from "data", even if there is no match.

 data = data.frame(id=c(1,2,3,4,5), state=c("KS","MN","AL","FL","CA")) scores = data.frame(id=c(1,1,1,2,2,3,3,3), score=c(66,75,78,86,85,76,75,90)) data %>% dplyr::left_join(dplyr::distinct(scores, id, .keep_all = T)) # Joining, by = "id" # id state score # 1 1 KS 66 # 2 2 MN 86 # 3 3 AL 76 # 4 4 FL NA # 5 5 CA NA

Also, if you want to replace NA in the new data.frame, try the tidyr :: replace_na () function. Example:

 data %>% dplyr::left_join(dplyr::distinct(scores, id, .keep_all = T)) %>% tidyr::replace_na(replace = list("score"=0L)) # Joining, by = "id" # id state score # 1 1 KS 66 # 2 2 MN 86 # 3 3 AL 76 # 4 4 FL 0 # 5 5 CA 0

+2

Huanfa chen Apr 7 '17 at 13:16

source share

Arun · Accepted Answer · 2016-06-10T13:29:52+0000

Using data.table along with mult = "first" and nomatch = 0L :

 require(data.table) setDT(scores); setDT(data) # convert to data.tables by reference scores[data, mult = "first", on = "id", nomatch=0L] # id score state # 1: 1 66 KS # 2: 2 86 MN # 3: 3 76 AL

For each row in the data id column, matching rows were found in the scores ' id column, and the first one is saved (because mult = "first" ). If there are no matches, they are deleted (due to nomatch = 0L ).

Select only the first row when combining data frames with multiple matches

More articles: