Why R data.table is not supported for keys without ASCII on Windows

Ok, I submitted a question about Github , but didn’t get an answer. data.table is a great R package that helps us in our daily work.

However, after version 1.9.6, it suddenly does not support non-ASCII keys in windows unless the column is UTF-8 encoded (the encoding of non-ASCII characters by default in R is platform dependent).

This is a very likely mistake (and a big mistake I would say). I am surprised that no one pays attention to this, and no one complains, since the error has existed for almost 2 years.

I spent hours trying to solve the problem, but could not. Bound commits

They are actually trying to convert other encoding characters to UTF-8, and then sort and compare all characters in UTF-8. It seems that the encoding processing is correct. However, I suspect that the error is hidden there. The implementation of data.table really complicated, I ask if anyone can help, so that we can do PR to solve this problem.

Many thanks.

Minimal reproducible example

Dataset

 library(data.table) ## data.table 1.10.5 IN DEVELOPMENT built 2017-12-01 20:06:10 UTC ## The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way ## Documentation: ?data.table, example(data.table) and browseVignettes("data.table") ## Release notes, videos and slides: http://r-datatable.com dt <- data.table( x = c("公允价值变动损益", "红利收入", "价差收入", "其他业务支出", "资产减值损失"), y = 1:5, key = "x" ) 

Error (returns NA) if the encoding is native

 dt[] ## xy ## 1: 公允价值变动损益 1 ## 2: 红利收入 2 ## 3: 价差收入 3 ## 4: 其他业务支出 4 ## 5: 资产减值损失 5 Encoding(dt$x) ## [1] "unknown" "unknown" "unknown" "unknown" "unknown" dt[J("公允价值变动损益")][] ## xy ## 1: 公允价值变动损益 NA 

Successfully only if the encoding is converted to utf8

Now it returns the correct answer 1 . Note that the dt order now also becomes different, which should not happen.

 dt[, x := enc2utf8(x)] setkey(dt, x) dt[] ## xy ## 1: 价差收入 3 ## 2: 公允价值变动损益 1 ## 3: 其他业务支出 4 ## 4: 红利收入 2 ## 5: 资产减值损失 5 Encoding(dt$x) ## [1] "UTF-8" "UTF-8" "UTF-8" "UTF-8" "UTF-8" dt[J("公允价值变动损益")][] ## xy ## 1: 公允价值变动损益 1 

sessionInfo

 sessionInfo() ## R version 3.4.1 (2017-06-30) ## Platform: x86_64-w64-mingw32/x64 (64-bit) ## Running under: Windows 7 x64 (build 7601) Service Pack 1 ## ## Matrix products: default ## ## locale: ## [1] LC_COLLATE=Chinese (Simplified)_People Republic of China.936 ## [2] LC_CTYPE=Chinese (Simplified)_People Republic of China.936 ## [3] LC_MONETARY=Chinese (Simplified)_People Republic of China.936 ## [4] LC_NUMERIC=C ## [5] LC_TIME=Chinese (Simplified)_People Republic of China.936 ## ## attached base packages: ## [1] stats graphics grDevices utils datasets methods base ## ## other attached packages: ## [1] data.table_1.10.5 ## ## loaded via a namespace (and not attached): ## [1] compiler_3.4.1 backports_1.1.1 magrittr_1.5 rprojroot_1.2 ## [5] tools_3.4.1 htmltools_0.3.6 Rcpp_0.12.13 stringi_1.1.5 ## [9] rmarkdown_1.8 knitr_1.17 stringr_1.2.0 digest_0.6.12 ## [13] evaluate_0.10.1 
+8
encoding r data.table
source share
1 answer

I ask my question to close it, because this question was resolved in PR .

For data.table strings data.table compare their values ​​in UTF8 encoding. However, due to the lack of two ENC2UTF8 in csort() and csort_pre() procedure for creating data.table actually depends on the encoding. On Windows, the fact that the default encoding is not UTF8 leads to some weird conclusion when there are lines in the keys.

To debug this case, you need to know how to print non-ASCII characters from the C procedure to the output of R. Using Rprintf() , you get a mess. You must first use translateChar() in the string.

Literature:

+7
source share

All Articles