Does the speed of a subset of data.table depend on specific key values ​​in a weird way?

Can someone explain the following result? If I am missing something (which I probably have), it seems that the speed of a subset of the data table depends on the specific values ​​stored in one of the columns, even if they are of the same class and have no obvious differences other than the value.

How is this possible?

> dim(otherTest) [1] 3572069 2 > dim(test) [1] 3572069 2 > length(unique(test$keys)) [1] 28741 > length(unique(otherTest$keys)) [1] 28742 > sapply(test,class) thingy keys "character" "character" > sapply(otherTest,class) thingy keys "character" "character" > class(test) [1] "data.table" "data.frame" > class(otherTest) [1] "data.table" "data.frame" > start = Sys.time() > newTest = otherTest[keys%in%partition] > end = Sys.time() > print(end - start) Time difference of 0.5438871 secs > start = Sys.time() > newTest = test[keys%in%partition] > end = Sys.time() > print(end - start) Time difference of 42.78009 secs 

EDIT Summary: Thus, the speed difference is not related to different sizes of data.tables, and has nothing to do with different amounts of unique values. As you can see in my revised example above, even after generating the keys so that they have the same number of unique values ​​(and are in the same general range and have at least one value, but are generally different), I get the same performance difference.

Regarding data sharing, I, unfortunately, can not share the test pattern, but I can share another test. The whole idea is that I tried to replicate the test table as accurately as possible (the same size, the same classes / types, the same keys, the number of NA values, etc.), so that I could send messages to SO - but, what strange, my made uptable.table behaved differently and I can’t understand why!

In addition, I will add that the only reason I suspected that the problem was coming from data.table was because a couple of weeks ago I had a similar problem with a subset of the data table. This turned out to be an actual error in the new release of data.table (I ended up deleting the question because it was a duplicate). The error is also associated with the use of the% in% function for a subset of data.table - if a duplicate is duplicated in the right argument% in%, it returns a duplicated result. therefore, if x = c (1,2,3) and y = c (1,1,2,2), x% in% y will return a vector of length 8. I mounted the data.table package, so I don’t think that it may be the same error - but possibly related?

EDIT (comment by Dean MacGregor)

 > sapply(test,class) thingy keys "character" "character" > sapply(otherTest,class) thingy keys "character" "character" # benchmarking the original test table > test2 =data.table(sapply(test ,as.numeric)) > otherTest2 =data.table(sapply(otherTest ,as.numeric)) > start = Sys.time() > newTest = test[keys%in%partition]) > end = Sys.time() > print(end - start) Time difference of 52.68567 secs > start = Sys.time() > newTest = otherTest[keys%in%partition] > end = Sys.time() > print(end - start) Time difference of 0.3503151 secs #benchmarking after converting to numeric > partition = as.numeric(partition) > start = Sys.time() > newTest = otherTest2[keys%in%partition] > end = Sys.time() > print(end - start) Time difference of 0.7240109 secs > start = Sys.time() > newTest = test2[keys%in%partition] > end = Sys.time() > print(end - start) Time difference of 42.18522 secs #benchmarking again after converting back to character > partition = as.character(partition) > otherTest2 =data.table(sapply(otherTest2 ,as.character)) > test2 =data.table(sapply(test2 ,as.character)) > start = Sys.time() > newTest =test2[keys%in%partition] > end = Sys.time() > print(end - start) Time difference of 48.39109 secs > start = Sys.time() > newTest = data.table(otherTest2[keys%in%partition]) > end = Sys.time() > print(end - start) Time difference of 0.1846113 secs 

Thus, deceleration is class independent.

EDIT: The problem explicitly comes from data.table, because I can convert to a matrix, and the problem disappears, and then it converts back to data.table and the problem returns.

EDIT: I noticed that the problem is with the way the data.table function handles duplicates, which sounds right, because it looks like an error I found last week in the data table 1.9.4 described above.

 > newTest =test[keys%in%partition] > end = Sys.time() > print(end - start) Time difference of 39.19983 secs > start = Sys.time() > newTest =otherTest[keys%in%partition] > end = Sys.time() > print(end - start) Time difference of 0.3776946 secs > sum(duplicated(test))/length(duplicated(test)) [1] 0.991954 > sum(duplicated(otherTest))/length(duplicated(otherTest)) [1] 0.9918879 > otherTest[duplicated(otherTest)] =NA > test[duplicated(test)]= NA > start = Sys.time() > newTest =otherTest[keys%in%partition] > end = Sys.time() > print(end - start) Time difference of 0.2272599 secs > start = Sys.time() > newTest =test[keys%in%partition] > end = Sys.time() > print(end - start) Time difference of 0.2041721 secs 

Thus, although they have the same number of duplicates, two data.tables (or, more specifically,% in% function in the data.table) explicitly handle duplicates in different ways. Another interesting observation related to duplicates is (note that I'm starting again with the original tables):

 > start = Sys.time() > newTest =test[keys%in%unique(partition)] > end = Sys.time() > print(end - start) Time difference of 0.6649222 secs > start = Sys.time() > newTest =otherTest[keys%in%unique(partition)] > end = Sys.time() > print(end - start) Time difference of 0.205637 secs 

Therefore, removing duplicates from the right argument in% in% also fixes the problem.

So, given this new information, the question remains: why do these two data.tables handle duplicate values ​​differently?

+5
source share
3 answers

You focus on data.table when match ( %in% is determined by the match operation) and the size of the vectors you should focus on. Playable example:

 library(microbenchmark) set.seed(1492) # sprintf to keep the same type and nchar of your values keys_big <- sprintf("%014d", sample(5000, 4000000, replace=TRUE)) keys_small <- sprintf("%014d", sample(5000, 30000, replace=TRUE)) partition <- sample(keys_big, 250) microbenchmark( "big"=keys_big %in% partition, "small"=keys_small %in% partition ) ## Unit: milliseconds ## expr min lq mean median uq max neval cld ## big 167.544213 184.222290 205.588121 195.137671 205.043641 376.422571 100 b ## small 1.129849 1.269537 1.450186 1.360829 1.506126 2.848666 100 a 

From the docs:

match returns the position vector of the (first) matches of its first argument in the second.

This inherently means that it will depend on the size of the vectors and how matches are found "near the top" (or not found).

However, you can use %chin% from data.table to speed things up as you use character vectors:

 library(data.table) microbenchmark( "big"=keys_big %chin% partition, "small"=keys_small %chin% partition ) ## Unit: microseconds ## expr min lq mean median uq max neval cld ## big 36312.570 40744.2355 47884.3085 44814.3610 48790.988 119651.803 100 b ## small 241.045 264.8095 336.1641 283.9305 324.031 1207.864 100 a 

You can also use the fastmatch package (but you already have data.table loaded and working with symbol vectors, so 6/1 | 0,5 * 12):

 library(fastmatch) # gives us similar syntax & functionality as %in% and %chin% "%fmin%" <- function(x, table) fmatch(x, table, nomatch = 0) > 0 microbenchmark( "big"=keys_big %fmin% partition, "small"=keys_small %fmin% partition ) ## Unit: microseconds ## expr min lq mean median uq max neval cld ## big 75189.818 79447.5130 82508.8968 81460.6745 84012.374 124988.567 100 b ## small 443.014 471.7925 525.2719 498.0755 559.947 850.353 100 a 

Regardless, the size of any of the vectors will ultimately determine how fast / slow the operation is. But the last two options at least give you faster results. Here is a comparison of all three together for small and large vectors:

 library(ggplot2) library(gridExtra) microbenchmark( "small_in"=keys_small %in% partition, "small_ch"=keys_small %chin% partition, "small_fm"=keys_small %fmin% partition, unit="us" ) -> small microbenchmark( "big_in"=keys_big %in% partition, "big_ch"=keys_big %chin% partition, "big_fm"=keys_big %fmin% partition, unit="us" ) -> big grid.arrange(autoplot(small), autoplot(big)) 

enter image description here

UPDATE

Based on OP comments, here is another guideline for thinking with and without data.table subsets:

 dat_big <- data.table(keys=keys_big) microbenchmark( "dt" = dat_big[keys %in% partition], "not_dt" = dat_big$keys %in% partition, "dt_ch" = dat_big[keys %chin% partition], "not_dt_ch" = dat_big$keys %chin% partition, "dt_fm" = dat_big[keys %fmin% partition], "not_dt_fm" = dat_big$keys %fmin% partition ) ## Unit: milliseconds ## expr min lq mean median uq max neval cld ## dt 11.74225 13.79678 15.90132 14.60797 15.66586 129.2547 100 a ## not_dt 160.61295 174.55960 197.98885 184.51628 194.66653 305.9615 100 f ## dt_ch 46.98662 53.96668 66.40719 58.13418 63.28052 201.3181 100 c ## not_dt_ch 37.83380 42.22255 50.53423 45.42392 49.01761 147.5198 100 b ## dt_fm 78.63839 92.55691 127.33819 102.07481 174.38285 374.0968 100 e ## not_dt_fm 67.96827 77.14590 99.94541 88.75399 95.47591 205.1925 100 d 
+3
source

If your data is linked more slowly, then you consume it, you might consider installing a key in your data after each download, in order to continue further queries using a clustered key and indexes.
The tuning key is relatively cheap due to the accurate and modern implementation of the sorting algorithm.

 library(data.table) library(microbenchmark) set.seed(1492) keys_big <- sprintf("%014d", sample(5000, 4000000, replace=TRUE)) keys_small <- sprintf("%014d", sample(5000, 30000, replace=TRUE)) partition <- sample(keys_big, 250) dat_big <- data.table(keys=keys_big, key = "keys") microbenchmark( "dt" = dat_big[keys %in% partition], "not_dt" = dat_big$keys %in% partition, "dt_ch" = dat_big[keys %chin% partition], "not_dt_ch" = dat_big$keys %chin% partition, "dt_key" = dat_big[partition] ) # Unit: milliseconds # expr min lq mean median uq max neval # dt 5.810935 6.100602 6.830618 6.493006 6.825171 20.47223 100 # not_dt 237.730092 246.318824 266.484226 257.507188 272.433109 461.17918 100 # dt_ch 62.822514 66.169728 71.522330 69.865380 75.056333 103.45799 100 # not_dt_ch 51.292627 52.551307 58.236860 54.920637 59.762000 215.65466 100 # dt_key 5.941748 6.210253 7.251318 6.568678 7.004453 23.45361 100 

Key Installation Time

 dat_big <- data.table(keys=keys_big) system.time(setkey(dat_big, keys)) # user system elapsed # 0.230 0.008 0.238 

This is on recent 1.9.5.

+1
source

I would expect the operation time to be proportional to the size of the thing and otherThing , and I don't see their sizes in your output, so it’s hard to know exactly what to expect.

However, you have much more (124.28 times) unique values ​​in otherthing$keys than in thing$keys , so don't you expect the operation to take longer? It should check the values ​​in the table for each unique value that it finds (and you seem to know about it since you printed the value).

Note that the timing ratio is around 60.8.

0
source

All Articles