How to decode a shielded unicode string?

I'm not sure what this is called, so I have problems finding it. How can I decode a Unicode string from http\u00253A\u00252F\u00252Fexample.com to http://example.com using JavaScript? I tried unescape , decodeURI and decodeURIComponent , so I think that it remains only to replace the string.

EDIT: the line is not printed, but a substring from another piece of code. Therefore, to solve the problem, you should start with something like this:

 var s = 'http\\u00253A\\u00252F\\u00252Fexample.com'; 

I hope this shows why unescape () is not working.

+74
javascript decode urldecode
Oct. 25
source share
6 answers

Original answer:

 unescape(JSON.parse('"http\\u00253A\\u00252F\\u00252Fexample.com"')); > 'http://example.com' 

You can offload all the work to JSON.parse

Edit (2017-10-12) :

@MechaLynx and @ Kevin-Weber note that unescape() deprecated from non-working environments and does not exist in TypeScript. decodeURIComponent is a replacement replacement. For wider compatibility use below:

 decodeURIComponent(JSON.parse('"http\\u00253A\\u00252F\\u00252Fexample.com"')); > 'http://example.com' 
+89
Oct 13
source share

UPDATE . Please note that this solution, which should be applied to older browsers or platforms other than the browser, is supported for training purposes. Please refer to @radicand below for a more current answer.




This is a unicode escaped string. The string was first escaped, then encoded using unicode. To return to normal:

 var x = "http\\u00253A\\u00252F\\u00252Fexample.com"; var r = /\\u([\d\w]{4})/gi; x = x.replace(r, function (match, grp) { return String.fromCharCode(parseInt(grp, 16)); } ); console.log(x); // http%3A%2F%2Fexample.com x = unescape(x); console.log(x); // http://example.com 

To explain: I use regex to search \u0025 . However, since my replacement operation requires only part of this line, I use parentheses to isolate the part I'm going to reuse, 0025 . This isolated part is called a group.

The gi part at the end of the expression means that it must match all instances in the string, not just the first one, and the match must be case insensitive. This may seem redundant given an example, but it adds versatility.

Now, in order to convert from one line to another, I need to follow several steps for each group of each match, and I cannot do this simply by converting the line. It is useful that the String.replace operation can take a function that will be executed for each match. Returning this function will replace the match in the string.

I use the second parameter that this function takes, which is the group I should use, and convert it to the equivalent utf-8 sequence, and then use the unescape built-in function to decode the string to its correct form.

+107
Oct. 25 2018-11-11T00:
source share

Note that using unescape() is deprecated and does not work with the TypeScript compiler, for example.

Based on radicand's answer and the comments section below, here's an updated solution:

 var string = "http\\u00253A\\u00252F\\u00252Fexample.com"; decodeURIComponent(JSON.parse('"' + string.replace(/\"/g, '\\"') + '"')); 

http://example.com

+16
Nov 03 '16 at 20:33
source share

Have a look at this page: http://www.rishida.net/tools/conversion/

Paste your code in the upper text box (remove the double slashes first).

The code is open source: http://www.rishida.net/tools/conversion/conversionfunctions.js

+4
Oct 25 2018-11-11T00:
source share

I don't have enough reputation to put this in a comment on existing answers:

unescape only obsolete to work with a URI (or any encoded utf-8), which probably suits most people. encodeURIComponent converts js string to escaped UTF-8 and decodeURIComponent only works with escaped UTF-8 bytes. It throws an error for something like decodeURIComponent('%a9'); // error decodeURIComponent('%a9'); // error , because extended ascii is not valid utf-8 (although this is still a unicode value), whereas unescape('%a9'); // Β© unescape('%a9'); // Β© So you need to know your data when using decodeURIComponent.

decodeURIComponent will not work in "%C2" or in single byte on 0x7f , because in utf-8, which indicates part of the surrogate. However, decodeURIComponent("%C2%A9") //gives you Β© Unescape will not work properly on this // © And this will not cause an error, so unescape may cause errors if you do not know your data.

+2
Mar 15 '18 at 22:15
source share

Using JSON.decode for this has significant flaws that you should be aware of:

  • You must enclose the string in double quotation marks
  • Many characters are not supported and must be escaped by themselves. For example, passing any of the following elements to JSON.decode (after wrapping them in double quotes) will result in an error, even if they are all valid: \\n , \n , \\0 , a"a
  • It does not support hex transitions: \\x45
  • It does not support Unicode code point sequences: \\u{045}

There are other caveats. Essentially, using JSON.decode for this purpose is a hack and does not work as you can always expect. You should stick to using the JSON library to handle JSON, not string operations.




I recently ran into this problem myself and wanted to have a reliable decoder, so I wrote it myself. It is fully and thoroughly tested and available here: https://github.com/iansan5653/unraw . It is as close as possible to the JavaScript standard.

Explanation:

The source text contains about 250 lines, so I won’t include everything here, but, in fact, it uses the following regular expression to search for all escape sequences and then parse them using parseInt(string, 16) to decode base-16 numbers and then String.fromCodePoint(number) to get the corresponding character:

 /\\(?:(\\)|x([\s\S]{0,2})|u(\{[^}]*\}?)|u([\s\S]{4})\\u([^{][\s\S]{0,3})|u([\s\S]{0,4})|([0-3]?[0-7]{1,2})|([\s\S])|$)/g 

Comment (NOTE: This regular expression matches all escape sequences, including invalid sequences. If a line throws an error in JS, it will throw an error in my library [ie '\x!!' will fail]]:

 / \\ # All escape sequences start with a backslash (?: # Starts a group of 'or' statements (\\) # If a second backslash is encountered, stop there (it an escaped slash) | # or x([\s\S]{0,2}) # Match valid hexadecimal sequences | # or u(\{[^}]*\}?) # Match valid code point sequences | # or u([\s\S]{4})\\u([^{][\s\S]{0,3}) # Match surrogate code points which get parsed together | # or u([\s\S]{0,4}) # Match non-surrogate Unicode sequences | # or ([0-3]?[0-7]{1,2}) # Match deprecated octal sequences | # or ([\s\S]) # Match anything else ('.' does not match newlines) | # or $ # Match the end of the string ) # End the group of 'or' statements /g # Match as many instances as there are 
Example

Example

Using this library:

 import unraw from "unraw"; let step1 = unraw('http\\u00253A\\u00252F\\u00252Fexample.com'); // yields "http%3A%2F%2Fexample.com" // Then you can use decodeURIComponent to further decode it: let step2 = decodeURIComponent(step1); // yields http://example.com 
0
Aug 19 '19 at 16:25
source share



All Articles