The data.table connection (error in vecseq) is the key for both on X and i?

I am new to R and data.table , which I find useful and fast. I am trying to join two data tables:

 > TotFreq Legacy_Store_Number WeekDay Date Item_Key Distr NoSellingDays meanUnits ItemType 1: 113802 1 2013-03-24 000000000120 2.428985e-04 0 8.00 FM 2: 113802 1 2013-03-24 000000000126 1.104030e-03 0 47.50 FM 3: 113802 1 2013-03-24 000000000170 1.126004e-03 0 48.75 FM 4: 113802 1 2013-03-24 000000000180 5.143034e-04 0 19.00 FM 5: 113802 1 2013-03-24 000000000260 3.854306e-04 0 12.25 FM 160167: 113802 7 2013-03-23 978125002327 5.902655e-07 27 1.00 SM 160168: 113802 7 2013-03-23 978141970584 1.770796e-06 25 1.00 SM 160169: 113802 7 2013-03-23 978145300697 1.180531e-06 26 1.00 SM 160170: 113802 7 2013-03-23 978145552558 5.902655e-07 27 1.00 SM 160171: 113802 7 2013-03-23 978160139536 5.902655e-07 27 1.00 SM > Count_SM_FM Legacy_Store_Number WeekDay ItemType ObjItems 1: 113802 1 SM 12305 2: 113802 1 FM 1942 3: 113802 2 SM 11014 4: 113802 2 FM 1398 5: 113802 3 SM 10154 6: 113802 3 FM 1117 7: 113802 4 SM 10414 8: 113802 4 FM 1167 9: 113802 5 SM 10258 10: 113802 5 FM 1200 11: 113802 6 SM 11116 12: 113802 6 FM 1575 13: 113802 7 SM 13098 14: 113802 7 FM 2326 > setkey(TotFreq,Legacy_Store_Number,WeekDay,ItemType) > > ResultJoin <- TotFreq[Count_SM_FM] Error in vecseq(f__, len__, if (allow.cartesian) NULL else as.integer(max(nrow(x), : Join results in 320342 rows; more than 160171 = max(nrow(x),nrow(i)). Check for duplicate key values in i, each of which join to the same group in x over and over again. If that ok, try including `j` and dropping `by` (by-without-by) so that j runs for each group to avoid the large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE. Otherwise, please search for this error message in the FAQ, Wiki, Qaru and datatable-help for advice. 

But I do not have duplicate keys in i !

Using:

 > ResultJoin <- TotFreq[Count_SM_FM,allow.cartesian=T] > > ResultJoin Legacy_Store_Number WeekDay Date Item_Key Distr NoSellingDays meanUnits ItemType ItemType.1 ObjItems 1: 113802 1 2013-03-24 000000000120 2.428985e-04 0 8.00 FM SM 12305 2: 113802 1 2013-03-24 000000000126 1.104030e-03 0 47.50 FM SM 12305 3: 113802 1 2013-03-24 000000000170 1.126004e-03 0 48.75 FM SM 12305 4: 113802 1 2013-03-24 000000000180 5.143034e-04 0 19.00 FM SM 12305 5: 113802 1 2013-03-24 000000000260 3.854306e-04 0 12.25 FM SM 12305 --- 320338: 113802 7 2013-03-23 978125002327 5.902655e-07 27 1.00 SM FM 2326 320339: 113802 7 2013-03-23 978141970584 1.770796e-06 25 1.00 SM FM 2326 320340: 113802 7 2013-03-23 978145300697 1.180531e-06 26 1.00 SM FM 2326 320341: 113802 7 2013-03-23 978145552558 5.902655e-07 27 1.00 SM FM 2326 320342: 113802 7 2013-03-23 978160139536 5.902655e-07 27 1.00 SM FM 2326 

I actually get duplicate entries that I had in my original TotFreq table. If I add the key also on Count_SM_FM , the connection will work:

 > setkey(TotFreq,Legacy_Store_Number,WeekDay,ItemType) > setkey(Count_SM_FM,Legacy_Store_Number,WeekDay,ItemType) > ResultJoin <- TotFreq[Count_SM_FM] > > ResultJoin Legacy_Store_Number WeekDay ItemType Date Item_Key Distr NoSellingDays meanUnits ObjItems 1: 113802 1 FM 2013-03-24 000000000120 2.428985e-04 0 8.00 1942 2: 113802 1 FM 2013-03-24 000000000126 1.104030e-03 0 47.50 1942 3: 113802 1 FM 2013-03-24 000000000170 1.126004e-03 0 48.75 1942 4: 113802 1 FM 2013-03-24 000000000180 5.143034e-04 0 19.00 1942 5: 113802 1 FM 2013-03-24 000000000260 3.854306e-04 0 12.25 1942 --- 160167: 113802 7 SM 2013-03-23 978125002327 5.902655e-07 27 1.00 13098 160168: 113802 7 SM 2013-03-23 978141970584 1.770796e-06 25 1.00 13098 160169: 113802 7 SM 2013-03-23 978145300697 1.180531e-06 26 1.00 13098 160170: 113802 7 SM 2013-03-23 978145552558 5.902655e-07 27 1.00 13098 160171: 113802 7 SM 2013-03-23 978160139536 5.902655e-07 27 1.00 13098 

I tried to check the example, maybe the problem was the lack of key variables as the first columns in TotFreq or in the absence of Count_SM_FM not sorted, but I could not reproduce the error

 > daysType <- data.table( + key1=c(1,1,1,1,1,1,1,1,1,1,1,1,1,1), + key2=c(1,1,2,2,3,3,4,4,5,5,6,6,7,7), + key3=c("b","a","a","b","a","b","a","b","a","b","a","b","a","b"), + var1=c(2,4,6,8,4,5,7,3,7,9,6,3,5,6) + ) > > > detailData <- data.table( + key1=c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1), + key2=c(1,1,1,1,1,1,1,2,2,2,2,2,2,2,3,3,3,3,3,3,4,4,4,4,4,4,4,4,5,5,5,5,5,5,5,6,6,6,6,6,6,7,7,7,7,7,7,7,7), + var2=c(10,11,12,13,15,16,17,10,11,12,13,14,15,16,10,11,12,15,16,17,10,11,12,13,14,15,16,17,10,11,13,14,15,16,17,10,11,12,13,14,15,10,11,12,13,14,15,16,17), + var3=c(1,2,4,6,6,7,3,6,8,9,3,5,7,8,6,7,8,6,7,2,4,6,7,8,2,3,5,7,4,7,8,3,6,4,2,5,7,3,6,7,3,4,2,4,6,4,7,2,9), + key3=c("a","a","a","a","b","b","b","a","a","a","a","b","b","b","a","a","a","b","b","b","a","a","a","a","b","b","b","b","a","a","a","b","b","b","b","a","a","a","a","b","b","a","a","a","a","b","b","b","b") + ) > > setkey(detailData,key1,key2,key3) > JoinResult <- detailData[daysType] 

The problem is different from the set

Error connecting two .tables data

because there allow.cartesian solves the problem.

What is the problem? Why Count_SM_FM adding a key to Count_SM_FM solve it?

Thanks!

+7
join r data.table
source share
1 answer

Oct 2014 Update: Arun fixed it in version 1.1.5:

allow.cartesian now ignored when i has no duplicates, # 742 and # 508 . Thanks to @nigmastar, @ user3645882 and others for reporting.



Previous answer ...

First send the address allow.cartesian . The error message should probably be modified to indicate that you can get large sizes even if you don't have duplicates in i , but you have duplicates on the left side of data.table . Here is a simple example:

 dt1 = data.table(a = c(1,1), b = 1:2, key = 'a') dt2 = data.table(a = c(1,2), c = 3:4) dt1[dt2] # this gives an error, because join results in 3 rows, as seen below dt1[dt2, allow.cartesian = TRUE] # abc #1: 1 1 3 #2: 1 2 3 #3: 2 NA 4 

Now that the key setting is on - no, you do not need to set the key for i , it just suppose that the first few columns are keys. Looking at your first connection result, you can see that it was not connected to ItemType and that you are using an older version of data.table (I am using 1.9.3). Therefore, I assume that either you did not actually set the key correctly and did not ItemType , or there was some error in older versions that have been fixed since then.

+4
source share

All Articles