How to use the international text?

Question

How to use the international text?

I have a bunch of names of authors from foreign countries in CSV, which R reads in perfect order. I am trying to clear them for loading in Mechanical Turk (which even one internationalized character really does not like). In doing so, I have a question (which will be published later), but I cannot even dput them in a reasonable way:

 > dput(df[306,"primauthfirstname"]) "Gwena\xeblle M" > test <- "Gwena\xeblle M" <simpleError in nchar(val): invalid multibyte string 1>

In other words, dput works just fine, but inserting the result fails. Why dput does not display the necessary information to copy / paste back into R (presumably all that needs to be done is to add encoding attributes to the structure operator?). How to do it?

Note that \xeb is a valid character with respect to R:

 > gsub("\xeb","", turk.df[306,"primauthfirstname"] ) [1] "Gwenalle M"

But you cannot evaluate characters yourself - this is the hexadecimal code \ x ## or nothing:

 > gsub("\\x","", turk.df[306,"primauthfirstname"] ) [1] "Gwena\xeblle M"

+7

r internationalization

Ari B. Friedman Jul 6 '12 at 20:40

source share

1 answer

Theodore lytras · Accepted Answer · 2013-01-15T07:56:57+0000

dput() helppage says: "Writes a textual representation of the ASCII object R". Therefore, if your object contains non-ASCII characters, they cannot be represented and must be converted in some way.

Therefore, I suggest you use iconv() to convert your vector to dput ing. One approach:

 > test <- "Gwena\xeblle M" > out <- iconv(test, from="latin1", to="ASCII", sub="byte") > out [1] "Gwena<eb>lle M" > gsub('<eb>', 'ë', out) [1] "Gwenaëlle M"

which, as you can see, works in both directions. You can use gsub() to reverse convert bytes to characters (if your encoding supports it, e.g. utf-8).

The second approach is simpler (and I prefer it for your needs), but it works unilaterally, and your libiconv may not support it:

 > test <- "Gwena\xeblle M" > iconv(test, from="latin1", to="ASCII//TRANSLIT") [1] "Gwenaelle M"

Hope this helps!

How to use the international text?

More articles: