Can someone explain the following result? If I am missing something (which I probably have), it seems that the speed of a subset of the data table depends on the specific values stored in one of the columns, even if they are of the same class and have no obvious differences other than the value.
How is this possible?
> dim(otherTest) [1] 3572069 2 > dim(test) [1] 3572069 2 > length(unique(test$keys)) [1] 28741 > length(unique(otherTest$keys)) [1] 28742 > sapply(test,class) thingy keys "character" "character" > sapply(otherTest,class) thingy keys "character" "character" > class(test) [1] "data.table" "data.frame" > class(otherTest) [1] "data.table" "data.frame" > start = Sys.time() > newTest = otherTest[keys%in%partition] > end = Sys.time() > print(end - start) Time difference of 0.5438871 secs > start = Sys.time() > newTest = test[keys%in%partition] > end = Sys.time() > print(end - start) Time difference of 42.78009 secs
EDIT Summary: Thus, the speed difference is not related to different sizes of data.tables, and has nothing to do with different amounts of unique values. As you can see in my revised example above, even after generating the keys so that they have the same number of unique values (and are in the same general range and have at least one value, but are generally different), I get the same performance difference.
Regarding data sharing, I, unfortunately, can not share the test pattern, but I can share another test. The whole idea is that I tried to replicate the test table as accurately as possible (the same size, the same classes / types, the same keys, the number of NA values, etc.), so that I could send messages to SO - but, what strange, my made uptable.table behaved differently and I can’t understand why!
In addition, I will add that the only reason I suspected that the problem was coming from data.table was because a couple of weeks ago I had a similar problem with a subset of the data table. This turned out to be an actual error in the new release of data.table (I ended up deleting the question because it was a duplicate). The error is also associated with the use of the% in% function for a subset of data.table - if a duplicate is duplicated in the right argument% in%, it returns a duplicated result. therefore, if x = c (1,2,3) and y = c (1,1,2,2), x% in% y will return a vector of length 8. I mounted the data.table package, so I don’t think that it may be the same error - but possibly related?
EDIT (comment by Dean MacGregor)
> sapply(test,class) thingy keys "character" "character" > sapply(otherTest,class) thingy keys "character" "character"
Thus, deceleration is class independent.
EDIT: The problem explicitly comes from data.table, because I can convert to a matrix, and the problem disappears, and then it converts back to data.table and the problem returns.
EDIT: I noticed that the problem is with the way the data.table function handles duplicates, which sounds right, because it looks like an error I found last week in the data table 1.9.4 described above.
> newTest =test[keys%in%partition] > end = Sys.time() > print(end - start) Time difference of 39.19983 secs > start = Sys.time() > newTest =otherTest[keys%in%partition] > end = Sys.time() > print(end - start) Time difference of 0.3776946 secs > sum(duplicated(test))/length(duplicated(test)) [1] 0.991954 > sum(duplicated(otherTest))/length(duplicated(otherTest)) [1] 0.9918879 > otherTest[duplicated(otherTest)] =NA > test[duplicated(test)]= NA > start = Sys.time() > newTest =otherTest[keys%in%partition] > end = Sys.time() > print(end - start) Time difference of 0.2272599 secs > start = Sys.time() > newTest =test[keys%in%partition] > end = Sys.time() > print(end - start) Time difference of 0.2041721 secs
Thus, although they have the same number of duplicates, two data.tables (or, more specifically,% in% function in the data.table) explicitly handle duplicates in different ways. Another interesting observation related to duplicates is (note that I'm starting again with the original tables):
> start = Sys.time() > newTest =test[keys%in%unique(partition)] > end = Sys.time() > print(end - start) Time difference of 0.6649222 secs > start = Sys.time() > newTest =otherTest[keys%in%unique(partition)] > end = Sys.time() > print(end - start) Time difference of 0.205637 secs
Therefore, removing duplicates from the right argument in% in% also fixes the problem.
So, given this new information, the question remains: why do these two data.tables handle duplicate values differently?