Is this the best way to unescape unicode escape sequences in Ruby?

I have text that contains Unicode escape sequences such as \ u003C. This is what I came up with to undo it:

string.gsub(/\u(....)/) {|m| [$1].pack("H*").unpack("n*").pack("U*")}

Is it correct? (i.e. seems to work with my tests, but can anyone more knowledgeable find a problem with it?)

+7
source share
1 answer

Your regex /\u(....)/ has some problems.

First of all, \u does not work as it seems to you, in 1.9 you get an error message, and in 1.8 it will correspond only to one u , and not to the pair \u that you are "looking for"; you must use /\\u/ to find the required literal \u .

Secondly, your group (....) too permissive, allowing you to skip any four characters and not what you want. In 1.9 you need (\h{4}) (four hexadecimal digits), but in 1.8 you need ([\da-fA-F]{4}) since \h is a new thing.

So, if you want your regular expression to work in both 1.8 and 1.9, you should use /\\u([\da-fA-F]{4})/ . This gives you the following in 1.8 and 1.9:

 >> s = 'Where is \u03bc pancakes \u03BD house? And u1123!' => "Where is \\u03bc pancakes \\u03BD house? And u1123!" >> s.gsub(/\\u([\da-fA-F]{4})/) {|m| [$1].pack("H*").unpack("n*").pack("U*")} => "Where is μ pancakes ν house? And u1123!" 

Using pack and unpack to cripple a hexadecimal number into a Unicode character is probably good enough, but there may be better ways.

+17
source

All Articles