Convert Unicode string escaped to Unicode to ASCII

After reading all about iconv and Encoding , I'm still confused.

I'm clearing the source of the webpage I have a line that looks like this: 'pretty\u003D\u003Ebig' (displayed in the console R as 'pretty\\\u003D\\\u003Ebig' ). I want to convert this to an ASCII string, which should be 'pretty=>big' .

Simply put, if I installed

 x <- 'pretty\\u003D\\u003Ebig' 

How to convert to x to get pretty=>big ?

Any suggestions?

+8
r unicode unicode-string text-processing iconv
source share
7 answers

Use parsing, but do not evaluate the results:

 x1 <- 'pretty\\u003D\\u003Ebig' x2 <- parse(text = paste0("'", x1, "'")) x3 <- x2[[1]] x3 # [1] "pretty=>big" is.character(x3) # [1] TRUE length(x3) # [1] 1 
+7
source share

Despite the fact that I accepted Hong Oyi's answer, I cannot help but think parse , and eval is a heavy decision. Also, as stated, this is unsafe, although for my application I can be sure that I will not receive dangerous quotes.

So, I developed an alternative, somewhat cruel approach:

 udecode <- function(string){ uconv <- function(chars) intToUtf8(strtoi(chars, 16L)) ufilter <- function(string) { if (substr(string, 1, 1)=="|") uconv(substr(string, 2, 5)) else string } string <- gsub("\\\\u([[:xdigit:]]{4})", ",|\\1,", string, perl=TRUE) strings <- unlist(strsplit(string, ",")) string <- paste(sapply(strings, ufilter), collapse='') return(string) } 

Any simplifications are welcome!

+3
source share

With stringi package:

 > x <- 'pretty\\u003D\\u003Ebig' > stringi::stri_unescape_unicode(x) [1] "pretty=>big" 
+3
source share

I sympathize; I have struggled with R and unicode text in the past and not always successfully. If your data is in x , first try a global swap, something like this:

 x <- gsub("\u003D", "=>", x) 

I sometimes use a type construct

 lapply(x, utf8ToInt) 

to see where the high points of the code, for example. something over 150. It helps me find problems caused by inextricable spaces, for example, which seem to appear from time to time.

+1
source share

Uses for eval(parse) !

 eval(parse(text=paste0("'", x, "'"))) 

This has its own problems, of course, for example, to manually avoid any quotation marks in a string. But it should work for any valid Unicode sequences that may appear.

+1
source share
 > iconv('pretty\u003D\u003Ebig', "UTF-8", "ASCII") [1] "pretty=>big" 

but you have an additional way out

0
source share

The trick here is that '\\u003D' is actually 6 characters, while you want '\u003D' , which is just one character. A further trick is that to match these backslashes, you need to use double-resetting backslashes in the pattern:

 gsub("\\\\u003D\\\\u003E", "\u003D\u003E", x) #[1] "pretty=>big" 

To replace multiple characters with a single character, you need to target the entire template. You cannot just remove the backslash. (Since you pointed out that this is a more general problem, I think the answer may be to modify your as yet undescribed method to load this text.)

When I load your functions and dependencies, this code works:

 > freq <- ngram(c('pretty\u003D\u003Ebig'), year_start = 1950) > > str(freq) 'data.frame': 59 obs. of 4 variables: $ Year : num 1950 1951 1952 1953 1954 ... $ Phrase : Factor w/ 1 level "pretty=>big": 1 1 1 1 1 1 1 1 1 1 ... $ Frequency: num 1.52e-10 6.03e-10 5.98e-10 8.27e-10 8.13e-10 ... $ Corpus : Factor w/ 1 level "eng_2012": 1 1 1 1 1 1 1 1 1 1 ... 

(So ​​I guess it's still unclear in which case.)

0
source share

All Articles