Dplyr mutate calls another data frame

I would like to mutate a data frame by applying a function that calls another frame. I can achieve this in several ways, but would like to know how to do it "correctly."

Here is an example of what I'm trying to do. I have a data frame with some initial times, and the second with some temporary observations. I would like to return a data frame indicating the start time and the number of observations that occur in some window after the start. eg.

set.seed(1337) df1 <- data.frame(id=LETTERS[1:3], start_time=1:3*10) df2 <- data.frame(time=runif(100)*100) lapply(df1$start_time, function(s) sum(df2$time>s & df2$time<(s+15))) 

The best I have used so far with dplyr is the following (but this loses the identity variables):

 df1 %>% rowwise() %>% do(count = filter(df2, time>.$start_time, time < (.$start_time + 15))) %>% mutate(n=nrow(count)) 

output:

 Source: local data frame [3 x 2] Groups: <by row> # A tibble: 3 × 2 count n <list> <int> 1 <data.frame [17 × 1]> 17 2 <data.frame [18 × 1]> 18 3 <data.frame [10 × 1]> 10 

I expected to be able to do this:

 df1 <- data.frame(id=LETTERS[1:3], start_time=1:3*10) df2 <- data.frame(time=runif(100)*100) df1 %>% group_by(id) %>% mutate(count = nrow(filter(df2, time>start_time, time<(start_time+15)))) 

but this returns an error:

 Error: comparison (6) is possible only for atomic and list types 

What is the way dplyr do this?

+5
source share
2 answers

Another slightly different approach using dplyr :

 result <- df1 %>% group_by(id) %>% summarise(count = length(which(df2$time > start_time & df2$time < (start_time+15)))) print(result) ### A tibble: 3 x 2 ## id count ## <fctr> <int> ##1 A 17 ##2 B 18 ##3 C 10 

I believe that you can use length and which to count the number of occurrences for which your condition is true for every id in df1 . Then the id group and use it for summarise .


If more than one start_time per id is possible, you can use the same function, but rowwise and mutate :

 result <- df1 %>% rowwise() %>% mutate(count = length(which(df2$time > start_time & df2$time < (start_time+15)))) print(result) ##Source: local data frame [3 x 3] ##Groups: <by row> ## ### A tibble: 3 x 3 ## id start_time count ## <fctr> <dbl> <int> ##1 A 10 17 ##2 B 20 18 ##3 C 30 10 
+2
source

Here is one option with data.table where we can use non-equi join

 library(data.table)#1.9.7+ setDT(df1)[, start_timeNew := start_time + 15] setDT(df2)[df1, .(id, .N), on = .(time > start_time, time < start_timeNew), by = .EACHI][, c('id', 'N'), with = FALSE] # id N #1: A 17 #2: B 18 #3: C 10 

which gives the same score as in the OP base R method

 sapply(df1$start_time, function(s) sum(df2$time>s & df2$time<(s+15))) #[1] 17 18 10 

If we need the id variable as well as the result in dplyr , we can change the OP code

 df1 %>% rowwise() %>% do(data.frame(., count = filter(df2, time>.$start_time, time < (.$start_time + 15)))) %>% group_by(id) %>% summarise(n = n()) # id n # <fctr> <int> #1 A 17 #2 B 18 #3 C 10 

Or another map parameter from purrr with dplyr

 library(purrr) df1 %>% split(.$id) %>% map_df(~mutate(., N = sum(df2$time >start_time & df2$time < start_time + 15))) %>% select(-start_time) # id N #1 A 17 #2 B 18 #3 C 10 
+3
source

All Articles