Utf-8 is lost when converting from a list to data.frame to R

I am using R 3.2.0 with RStudio 0.98.1103 on a 64-bit version of Windows 7. Windows "regional and language settings" of my computer is English (US).

For some reason, the following code replaces my Czech characters "č" and "ř" with "c" and "r" in the text "Koryčany nad přehradou", when I read the XML file encoded in utf-8 from the web, parse the XML file in the list and convert the list to a data.frame file.

library(XML) url <- "http://hydrodata.info/chmi-h/cuahsi_1_1.asmx/GetSiteInfoObject?site=CHMI-H:1263&authToken=" doc <- xmlRoot(xmlTreeParse(url, getDTD=FALSE, useInternalNodes = TRUE)) infoList <- xmlToList(doc[[2]][[1]]) siteName <- infoList$siteName #this still displays correctly "Koryčany nad přehradou" print(siteName) #make a data.frame from the list item. I suspect here is the problem. df <- data.frame(name=siteName, id=1) #now the Czech characters are lost. I see only "Korycany nad prehradou" View(df) write.csv(df,"test.csv") #the test.csv file also contains "Korycany nad prehradou" #instead of "Koryčany nad přehradou" 

What is the problem? How to make R correctly display its data.frame file with all special utf-8 characters and save the CSV file without losing the Czech characters "č" and "ř"?

+7
r utf-8 character-encoding dataframe locale
source share
1 answer

This is not an ideal answer, but the following workaround solved the problem for me. I tried to understand the behavior or R and make an example so that my R script would produce the same results on both Windows and Linux:

(1) Get XML data in UTF-8 from the Internet

 library(XML) url <- "http://hydrodata.info/chmi-h/cuahsi_1_1.asmx/GetSiteInfoObject?site=CHMI-H:1263&authToken=" doc <- xmlRoot(xmlTreeParse(url, getDTD=FALSE, useInternalNodes = TRUE)) infoList <- xmlToList(doc[[2]][[1]]) siteName <- infoList$siteName 

(2) Print the text from the Internet: Encoding is UTF-8, the display in the R console is also correct using both Czech and English in Windows:

 > Sys.getlocale(category="LC_CTYPE") [1] "English_United States.1252" > print(siteName) [1] "Koryčany nad přehradou" > Encoding(siteName) [1] "UTF-8" > 

(3) Try creating and viewing the data.frame file. This is problem. The data.frame format does not display correctly in both the RStudio view and the console:

 df <- data.frame(name=siteName, id=1) df name id 1 Korycany nad prehradou 1 

(4) Try using a matrix instead. Surprisingly, the matrix is ​​displayed correctly in console R.

 m <- as.matrix(df) View(m) #this shows incorrectly in RStudio m #however, this shows correctly in the R console. name id [1,] "Koryčany nad přehradou" "1" 

(5) Change the locale. If I am on Windows, set the locale to Czech. If I'm on Unix or Mac, set the locale to UTF-8. NOTE. This has some problems when I run the script in RStudio, it is obvious that RStudio does not always respond immediately to the Sys.setlocale command.

 #remember the original locale. original.locale <- Sys.getlocale(category="LC_CTYPE") #for Windows set locale to Czech. Otherwise set locale to UTF-8 new.locale <- ifelse(.Platform$OS.type=="windows", "Czech_Czech Republic.1250", "en_US.UTF-8") Sys.setlocale("LC_CTYPE", new.locale) 

(7) Write the data to a text file. IMPORTANT: do not use write.csv , but use write.table . When my Czech language is on my English Windows, I have to use fileEncoding="UTF-8" in write.table . Now the text file is correctly displayed in Notepad ++, as well as in Excel.

 write.table(m, "test-czech-utf8.txt", sep="\t", fileEncoding="UTF-8") 

(8) Set the locale back to the original

 Sys.setlocale("LC_CTYPE", original.locale) 

(9) Try reading the text file back to R. NOTE. If I read the file, I had to set the encoding parameter (NOT fileEncoding!). The display of the data.frame file read from the file is still incorrect, but when I convert my data.frame to matrix , the Czech UTF-8 characters are preserved:

 data.from.file <- read.table("test-czech-utf8.txt", sep="\t", encoding="UTF-8") #the data.frame still has the display problem, "č" and "ř" get "lost" > data.from.file name id 1 Korycany nad prehradou 1 #see if a matrix displays correctly: YES it does! matrix.from.file <- as.matrix(data.from.file) > matrix.from.file name id 1 "Koryčany nad přehradou" "1" 

So, the lesson learned is that I need to convert my data.frame to matrix , set my locale to Czech (on Windows) or UTF-8 (on Mac and Linux) before writing my data with Czech characters to a file . Then, when I write the file, I have to make sure that fileEncoding must be set to UTF-8. On the other hand, when I later read the file, I can continue to work in English, but in read.table I have to set encoding="UTF-8" .

If anyone has a better solution, I will welcome your suggestions.

+4
source share

All Articles