How to write and read printed ascii?

10001100 is Ε’ Latin capital of the OE ligature in the extended ASCII table, I want to write it to a UTF-8 encoded file.

 zz <- file("c:/testbin", "wb") writeBin("10001100",zz) close(zz) 

When I open an office file (encoding = utf-8), I see Ε’ what I can’t read with readBin?

 zz <- file("c:/testbin", "rb") readBin(zz,raw())->x x [1] c5 readBin(zz,character())->x Warning message: In readBin(zz, character()) : incomplete string at end of file has been discarded x character(0) 
+7
r utf-8 ascii
source share
2 answers

There are several difficulties here.

Firstly, there are actually several extended ASCII tables . Since you are on Windows, you are probably using CP1252 , which is one of them, also called Windows-1252 or ANSI and the default "latin" Win encoding. However, the code for Ε’ changes in this family of tables. In CP1252 , "Ε’" is represented by 10001100 or "\x8c" , as you wrote. However, does not exist in ISO-8859-1 . And in UTF-8 it matches "\xc5\x92" or "\u0152" , as indicated by rlegendi.

So, to write UTF-8 from CP1252 -as-binary-as-string, you must convert your string to this "raw" number (class R for bytes), and then the character, change its "encoding" from CP1252 to UTF-8 (actually convert its byte value to the corresponding one for the same character in UTF-8 ), after which you can re-convert it to raw and finally write to a file:

 char_bin_str <- '10001100' char_u <- iconv(rawToChar(as.raw(strtoi(char_bin_str, base=2))), # "\x8c" 8c 140 '10001100' from="CP1252", to="UTF-8") test.file <- "~/test-unicode-bytes.txt" zz <- file(test.file, 'wb') writeBin(charToRaw(char_u), zz) close(zz) 

Secondly, when you readBin() , do not forget to specify the number of bytes, which is large enough ( n=file.info(test.file)$size here), otherwise it reads only the first byte (see below):

 zz <- file(test.file, 'rb') x <- readBin(zz, 'raw', n=file.info(test.file)$size) close(zz) > x [1] c5 92 

Third, if at the end you want to turn it back into a character that is correctly understood and displayed by R, you must first convert it to a string using rawToChar() . Now, how it will be displayed depends on your default encoding, see Sys.getlocale() to find out what it is (maybe something ends with 1252 on Windows). It is best to probably indicate that your character should be read as UTF-8 - otherwise it will be understood with your default encoding.

 xx <- rawToChar(x) Encoding(xx) <- "UTF-8" > xx [1] "Ε’" 

This should keep things in check, write the correct bytes in UTF-8 and be the same for every OS. Hope this helps.


PS : I'm not quite sure why c5 returned in your code, and I guess it would return c5 92 if you set n=2 (or more) as the readBin() parameter. On my machine (Mac OS X 10.7, R 3.0.2 and Win XP, R 2.15) it returns 31 , the ASCII hexadecimal representation of '1' (the first char in '10001100' , which makes sense), with your code. Perhaps you opened your file in Office as CP1252 and saved it as UTF-8 before returning to R?

+10
source share

Try this instead (I replaced the binary with UTF encoding, because I think it's better when you want this output):

 writeBin(charToRaw("\u0152"), zz) 
+1
source share

All Articles