Data.table: how to query the fastest range when a partial key matches?

On a data table with a specific key. How fast is a range query where the search key is a partial key match? Example. In the large data table (see Q1 for size), get a subset of the rows with the "Last Name" key starting with "Jo":

library(data.table) set.seed(20150731) dtUser <- data.table( Lastname=c("Jansen", "Jirovski", "Jocond", "Johnson", "Johnson", "Joule", "Jozef", "Jukanen", "Kanban"), UserID=sample(1:100, 9, replace=FALSE), key="Lastname") # Lastname UserID # 1: Jansen 77 # 2: Jirovski 69 # 3: Jocond 24 # 4: Johnson 7 # 5: Johnson 27 # 6: Joule 74 # 7: Jozef 9 # 8: Jukanen 54 # 9: Kanban 78 #---Query1 (OK): dtUser[Lastname > "Jo" & Lastname < "Ju",] # Lastname UserID # 1: Jocond 24 # 2: Johnson 7 # 3: Johnson 27 # 4: Joule 74 # 5: Jozef 9 #---Query2 (Failed): dtUser[J("Jo"),] # Lastname UserID # 1: Jo NA 

Question 1 . The size of the data table is potentially large (from 1 to 10 million rows, not more than 100 bytes / row). Fast performance is critical. Is Query1 query faster?

Question 2 . In the event that data.table is not suitable for finding an "ultrafast" range by matching a partial key. What is the most suitable data structure to use in R?

TEST RESULTS : Using Query1 in a 1.2 GB dataset: 40 million rows, 4 columns (Str1, Str2, Str3, BogusInteger). StrX here are unique short strings of medium length 10 characters, similar to Lastname people.

 c(nbRows=nrow(myDT), "SizeMB"=round(object.size(myDT) / 1e6, 0)) # nbRows SizeMB # 39480360 1208 setkey(myDT, Str1, Str2, Str3) # Partial key match (slower) system.time(dtTest <- myDT[Str1 > "tranbar" & Str1 < "tranbaz", ]) system.time(dtTest <- myDT[Str1 > "int" & Str1 < "intz" & Str2 > "mo" & Str2 < "mu", ]) system.time(dtTest <- myDT[Str1 > "int" & Str1 < "intz" & Str2 > "mo" & Str2 < "mu" & Str3 > "t" & Str3 < "u", ]) # Exact key match (faster) system.time(dtTest <- myDT[Str1 == "transportation", ]) system.time(dtTest <- myDT[Str1 == "transportation" & Str2 == "energy", ]) system.time(dtTest <- myDT[Str1 == "transportation" & Str2 == "energy" & Str2 == "costs", ]) 
  • Computer: desktop, 8 GB RAM, processor: Q9550, Xubuntu 14.04 x64
  • R 3.20, RStudio 0.99.446
  • data.table 1.9.4
  • Incomplete key match: The query lasts about 11 seconds per column, regardless of the rows returned (from 200,000 to 5 rows or even 0 in my various tests).
  • Exact key matching: much faster, almost instantly for 1 column.
  • Query Str1 column only: average time 11 seconds (partial), 0.02 seconds (exact)
  • Query Str1 and Str2: average time 23 sec (partial), 3.2 sec (exact)
  • Query Str1 and Str2 and Str3: average time 35 seconds (partial), 5.3 seconds (exact)
+4
source share

All Articles