On a data table with a specific key. How fast is a range query where the search key is a partial key match? Example. In the large data table (see Q1 for size), get a subset of the rows with the "Last Name" key starting with "Jo":
library(data.table) set.seed(20150731) dtUser <- data.table( Lastname=c("Jansen", "Jirovski", "Jocond", "Johnson", "Johnson", "Joule", "Jozef", "Jukanen", "Kanban"), UserID=sample(1:100, 9, replace=FALSE), key="Lastname")
Question 1 . The size of the data table is potentially large (from 1 to 10 million rows, not more than 100 bytes / row). Fast performance is critical. Is Query1 query faster?
Question 2 . In the event that data.table is not suitable for finding an "ultrafast" range by matching a partial key. What is the most suitable data structure to use in R?
TEST RESULTS : Using Query1 in a 1.2 GB dataset: 40 million rows, 4 columns (Str1, Str2, Str3, BogusInteger). StrX here are unique short strings of medium length 10 characters, similar to Lastname people.
c(nbRows=nrow(myDT), "SizeMB"=round(object.size(myDT) / 1e6, 0)) # nbRows SizeMB # 39480360 1208 setkey(myDT, Str1, Str2, Str3) # Partial key match (slower) system.time(dtTest <- myDT[Str1 > "tranbar" & Str1 < "tranbaz", ]) system.time(dtTest <- myDT[Str1 > "int" & Str1 < "intz" & Str2 > "mo" & Str2 < "mu", ]) system.time(dtTest <- myDT[Str1 > "int" & Str1 < "intz" & Str2 > "mo" & Str2 < "mu" & Str3 > "t" & Str3 < "u", ]) # Exact key match (faster) system.time(dtTest <- myDT[Str1 == "transportation", ]) system.time(dtTest <- myDT[Str1 == "transportation" & Str2 == "energy", ]) system.time(dtTest <- myDT[Str1 == "transportation" & Str2 == "energy" & Str2 == "costs", ])
- Computer: desktop, 8 GB RAM, processor: Q9550, Xubuntu 14.04 x64
- R 3.20, RStudio 0.99.446
- data.table 1.9.4
- Incomplete key match: The query lasts about 11 seconds per column, regardless of the rows returned (from 200,000 to 5 rows or even 0 in my various tests).
- Exact key matching: much faster, almost instantly for 1 column.
- Query Str1 column only: average time 11 seconds (partial), 0.02 seconds (exact)
- Query Str1 and Str2: average time 23 sec (partial), 3.2 sec (exact)
- Query Str1 and Str2 and Str3: average time 35 seconds (partial), 5.3 seconds (exact)