This is not an ideal answer, but the following workaround solved the problem for me. I tried to understand the behavior or R and make an example so that my R script would produce the same results on both Windows and Linux:
(1) Get XML data in UTF-8 from the Internet
library(XML) url <- "http://hydrodata.info/chmi-h/cuahsi_1_1.asmx/GetSiteInfoObject?site=CHMI-H:1263&authToken=" doc <- xmlRoot(xmlTreeParse(url, getDTD=FALSE, useInternalNodes = TRUE)) infoList <- xmlToList(doc[[2]][[1]]) siteName <- infoList$siteName
(2) Print the text from the Internet: Encoding is UTF-8, the display in the R console is also correct using both Czech and English in Windows:
> Sys.getlocale(category="LC_CTYPE") [1] "English_United States.1252" > print(siteName) [1] "Koryčany nad přehradou" > Encoding(siteName) [1] "UTF-8" >
(3) Try creating and viewing the data.frame file. This is problem. The data.frame format does not display correctly in both the RStudio view and the console:
df <- data.frame(name=siteName, id=1) df name id 1 Korycany nad prehradou 1
(4) Try using a matrix instead. Surprisingly, the matrix is displayed correctly in console R.
m <- as.matrix(df) View(m) #this shows incorrectly in RStudio m #however, this shows correctly in the R console. name id [1,] "Koryčany nad přehradou" "1"
(5) Change the locale. If I am on Windows, set the locale to Czech. If I'm on Unix or Mac, set the locale to UTF-8. NOTE. This has some problems when I run the script in RStudio, it is obvious that RStudio does not always respond immediately to the Sys.setlocale command.
#remember the original locale. original.locale <- Sys.getlocale(category="LC_CTYPE") #for Windows set locale to Czech. Otherwise set locale to UTF-8 new.locale <- ifelse(.Platform$OS.type=="windows", "Czech_Czech Republic.1250", "en_US.UTF-8") Sys.setlocale("LC_CTYPE", new.locale)
(7) Write the data to a text file. IMPORTANT: do not use write.csv , but use write.table . When my Czech language is on my English Windows, I have to use fileEncoding="UTF-8" in write.table . Now the text file is correctly displayed in Notepad ++, as well as in Excel.
write.table(m, "test-czech-utf8.txt", sep="\t", fileEncoding="UTF-8")
(8) Set the locale back to the original
Sys.setlocale("LC_CTYPE", original.locale)
(9) Try reading the text file back to R. NOTE. If I read the file, I had to set the encoding parameter (NOT fileEncoding!). The display of the data.frame file read from the file is still incorrect, but when I convert my data.frame to matrix , the Czech UTF-8 characters are preserved:
data.from.file <- read.table("test-czech-utf8.txt", sep="\t", encoding="UTF-8") #the data.frame still has the display problem, "č" and "ř" get "lost" > data.from.file name id 1 Korycany nad prehradou 1 #see if a matrix displays correctly: YES it does! matrix.from.file <- as.matrix(data.from.file) > matrix.from.file name id 1 "Koryčany nad přehradou" "1"
So, the lesson learned is that I need to convert my data.frame to matrix , set my locale to Czech (on Windows) or UTF-8 (on Mac and Linux) before writing my data with Czech characters to a file . Then, when I write the file, I have to make sure that fileEncoding must be set to UTF-8. On the other hand, when I later read the file, I can continue to work in English, but in read.table I have to set encoding="UTF-8" .
If anyone has a better solution, I will welcome your suggestions.