Unicode URL Decoding

The usual method for URL encoding a Unicode character is to break it into 2% HH codes. ( \ u4161 => % 41% 61 )

But how does Unicode stand out when decoding? How do you know that % 41% 61 \ u4161 versus \ x41 \ x61 ("Aa")?

Are there 8-bit characters that require encoding preceded by % 00 ?

Or is it supposed that Unicode characters are lost / split?

+6
unicode urldecode
source share
3 answers

According to Wikipedia :

Current standard

The general URI syntax provides that new URI schemes that provide representation of character data in a URI must, in effect, represent characters from an unconditional set without translation, and must convert all other characters in bytes according to UTF-8, and then percent encoding these values. This requirement was introduced in January 2005 with the publication of RFC 3986. URI schemes introduced before this date are not affected.

The current specification is not fixed - what to do with encoded characters. For example, in computers, character data appears in encoded form, at some level, and thus can be regarded as binary data or as character data when mapped to URI characters. Presumably, it depends on the URI scheme of the specification to take this possibility into account and require one or the other, but in practice, few, if any, really do.

Custom Implementations

There is a non-standard encoding for Unicode characters:% uxxxx, where xxxx is the Unicode value represented as four hexadecimal digits. This behavior is not specified by any RFC and has not been rejected by the W3C. The third edition of ECMA-262 still includes an escape (string) that uses this syntax, but also an encodeURI (uri) function that converts to UTF-8 and percent - encodes each octet.

So, it looks like it completely depends on the person who wrote the unencode method ... Isn't this a standard pleasure?

+7
source share

What I always did was the first UTF-8 encoding a Unicode string to make it a series of 8-bit characters before going around any of them with% HH.

PS - I can only hope that non-standard implementations (% uxxxx) are a bit and far from each other.

0
source share

Since URIs were introduced before unicode was around, or at least widely used, I assume this is a very specific implementation issue. UTF-8 encoding your text and then avoiding what normal sounds like the best idea, since it is completely backward compatible with any ASCII / ANSI systems in place, although you can get an odd wierd character or two.

At the other end, to decode, you free the text and get the string UTF-8. If someone using the old system tries to send your data to ASCII / ANSI, no harm will be done that (almost) UTF-8 is already encoded.

0
source share

All Articles