Ok, I submitted a question about Github , but didn’t get an answer. data.table is a great R package that helps us in our daily work.
However, after version 1.9.6, it suddenly does not support non-ASCII keys in windows unless the column is UTF-8 encoded (the encoding of non-ASCII characters by default in R is platform dependent).
This is a very likely mistake (and a big mistake I would say). I am surprised that no one pays attention to this, and no one complains, since the error has existed for almost 2 years.
I spent hours trying to solve the problem, but could not. Bound commits
They are actually trying to convert other encoding characters to UTF-8, and then sort and compare all characters in UTF-8. It seems that the encoding processing is correct. However, I suspect that the error is hidden there. The implementation of data.table really complicated, I ask if anyone can help, so that we can do PR to solve this problem.
Many thanks.
Minimal reproducible example
Dataset
library(data.table) ## data.table 1.10.5 IN DEVELOPMENT built 2017-12-01 20:06:10 UTC ## The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way ## Documentation: ?data.table, example(data.table) and browseVignettes("data.table") ## Release notes, videos and slides: http://r-datatable.com dt <- data.table( x = c("公允价值变动损益", "红利收入", "价差收入", "其他业务支出", "资产减值损失"), y = 1:5, key = "x" )
Error (returns NA) if the encoding is native
dt[] ## xy ## 1: 公允价值变动损益 1 ## 2: 红利收入 2 ## 3: 价差收入 3 ## 4: 其他业务支出 4 ## 5: 资产减值损失 5 Encoding(dt$x) ## [1] "unknown" "unknown" "unknown" "unknown" "unknown" dt[J("公允价值变动损益")][] ## xy ## 1: 公允价值变动损益 NA
Successfully only if the encoding is converted to utf8
Now it returns the correct answer 1 . Note that the dt order now also becomes different, which should not happen.
dt[, x := enc2utf8(x)] setkey(dt, x) dt[] ## xy ## 1: 价差收入 3 ## 2: 公允价值变动损益 1 ## 3: 其他业务支出 4 ## 4: 红利收入 2 ## 5: 资产减值损失 5 Encoding(dt$x) ## [1] "UTF-8" "UTF-8" "UTF-8" "UTF-8" "UTF-8" dt[J("公允价值变动损益")][] ## xy ## 1: 公允价值变动损益 1
sessionInfo
sessionInfo()