How to write a Unicode string to a text file in R Windows?

I figured out how to write Unicode strings, but am still puzzled by why it works.

str <- "ỏ" Encoding(str) # UTF-8 cat(str, file="no-iconv") # Written wrongly as <U+1ECF> cat(iconv(str, to="UTF-8"), file="yes-iconv") # Written correctly as ỏ 

I understand why the no-iconv approach does not work. This is because cat (and writeLines ) first converts the string to its own encoding, and then to to= encoding . In windows, this means that R first converts to Windows-1252 , which cannot understand, which leads to <U+1ECF> .

I do not understand why the yes-iconv approach works. If I understand correctly, the fact that iconv here simply means returning a UTF-8 encoded UTF-8 . But str already in UTF-8 ! Why doesn't iconv matter? Also, when iconv(str, to="UTF-8") is passed to cat , should cat not mess things up again, first converting to Windows-1252 ?

+7
source share
1 answer

I think setting the encoding (copy) of str to "unknown" before using cat() less magical and works just as well. I think this should avoid any unwanted character set conversions in cat() .

Here is an extended example demonstrating what I think in the original example:

 print_info <- function(x) { print(x) print(Encoding(x)) str(x) print(charToRaw(x)) } cat("(1) Original string (UTF-8)\n") str <- "\xe1\xbb\x8f" Encoding(str) <- "UTF-8" print_info(str) cat(str, file="no-iconv") cat("\n(2) Conversion to UTF-8, wrong input encoding (latin1)\n") ## from = "" is conversion from current locale, forcing "latin1" here str2 <- iconv(str, from="latin1", to="UTF-8") print_info(str2) cat(str2, file="yes-iconv") cat("\n(3) Converting (2) explicitly to latin1\n") str3 <- iconv(str2, from="UTF-8", to="latin1") print_info(str3) cat(str3, file="latin") cat("\n(4) Setting encoding of (1) to \"unknown\"\n") str4 <- str Encoding(str4) <- "unknown" print_info(str4) cat(str4, file="unknown") 

In a "Latin-1" locale (see ?l10n_info ), as used by R on Windows, the output files "yes-iconv" , "latin" and "unknown" must be correct (byte sequence 0xe1 , 0xbb , 0x8f , which is equal to "ỏ" ).

In the "UTF-8" locale "UTF-8" "no-iconv" and "unknown" files must be correct.

The result of the sample code is as follows, using R 3.3.2, the 64-bit version of Windows running in Wine:

 (1) Original string (UTF-8) [1] "ỏ" [1] "UTF-8" chr "<U+1ECF>""| __truncated__ [1] e1 bb 8f (2) Conversion to UTF-8, wrong input encoding (latin1) [1] "á»\u008f" [1] "UTF-8" chr "á»\u008f" [1] c3 a1 c2 bb c2 8f (3) Converting (2) explicitly to latin1 [1] "á»" [1] "latin1" chr "á»" [1] e1 bb 8f (4) Setting encoding of (1) to "unknown" [1] "á»" [1] "unknown" chr "á»" [1] e1 bb 8f 

The original iconv() example uses the argument from = "" by default, which means conversion from the current locale, which is effectively "latin1". Since the str encoding is actually “UTF-8,” the string byte representation is distorted in step (2), but then cat() implicitly restored when it (presumably) converts the string back to the current locale, as demonstrated by the equivalent conversion in step ( 3).

+2
source

All Articles