Multi-range subset

I want to get a list of values ​​that fall between several ranges.

library(data.table) values <- data.table(value = c(1:100)) range <- data.table(start = c(6, 29, 87), end = c(10, 35, 92)) 

I need results to include only the values ​​that fall between these ranges:

  results <- c(6, 7, 8, 9, 10, 29, 30, 31, 32, 33, 34, 35, 87, 88, 89, 90, 91, 92) 

I am currently doing this with a for loop,

 results <- data.table(NULL) for (i in 1:NROW(range){ results <- rbind(results, data.table(result = values[value >= range[i, start] & value <= range[i, end], value]))} 

however, the actual dataset is quite large, and I'm looking for a more efficient way.

Any suggestions are welcome! Thanks!

+3
r range data.table subset
source share
3 answers

Using the uneven combination of data.table :

 values[range, on = .(value >= start, value <= end), .(results = x.value)] 

which gives:

  results 1: 6 2: 7 3: 8 4: 9 5: 10 6: 29 7: 30 8: 31 9: 32 10: 33 11: 34 12: 35 13: 87 14: 88 15: 89 16: 90 17: 91 18: 92 

Or as suggested by @Henrik: values[value %inrange% range] . This also works very well on data.table with multiple columns:

 # create new data set.seed(26042017) values2 <- data.table(value = c(1:100), let = sample(letters, 100, TRUE), num = sample(100)) > values2[value %inrange% range] value let num 1: 6 v 70 2: 7 f 77 3: 8 u 21 4: 9 x 66 5: 10 g 58 6: 29 f 7 7: 30 w 48 8: 31 c 50 9: 32 e 5 10: 33 c 8 11: 34 y 19 12: 35 s 97 13: 87 j 80 14: 88 o 4 15: 89 h 65 16: 90 c 94 17: 91 k 22 18: 92 g 46 
+5
source share

If you have the latest version of data.table version of CRAN, you can use unions without equi. For example, you can create an index that can then be used to subset the source data:

 idx <- values[range, on = .(value >= start, value <= end), which = TRUE] # [1] 6 7 8 9 10 29 30 31 32 33 34 35 87 88 89 90 91 92 values[idx] 
+5
source share

Here is one method using lapply and %between%

 rbindlist(lapply(seq_len(nrow(range)), function(i) values[value %between% range[i]])) 

This method goes through the values ​​of data.table and the subset of ranges in each iteration according to the variable in the ranges. lapply returns the list that rbindlist creates in the data table. If you need a vector, replace rbindlist with unlist .


the criteria

To check the speed of each sentence according to the data, I made a quick comparison

 microbenchmark( lmo=rbindlist(lapply(seq_len(nrow(range)), function(i) values[value %between% range[i]])), dd={idx <- values[range, on = .(value >= start, value <= end), which = TRUE]; values[idx]}, jaap=values[range, on = .(value >= start, value <= end), .(results = x.value)], inrange=values[value %inrange% range]) 

It returned

 Unit: microseconds expr min lq mean median uq max neval cld lmo 1238.472 1460.5645 1593.6632 1520.8630 1613.520 3101.311 100 c dd 688.230 766.7750 885.1826 792.8615 825.220 3609.644 100 b jaap 798.279 897.6355 935.9474 921.7265 970.906 1347.380 100 b inrange 463.002 518.3110 563.9724 545.5375 575.758 1944.948 100 a 

As expected, my solution loops are quite a bit slower than others. However, the clear winner is %inrange% , which is essentially a vectorized extension of %between% .

+2
source share

All Articles