How to handle unicode values ​​in JSON strings?

I am writing a JSON parser in C ++ and am encountering a problem while parsing JSON strings:

The JSON specification states that JSON strings can contain Unicode characters in the form:

"here comes a unicode character: \u05d9 !" 

My JSON parser is trying to match JSON strings with std::string , as usual, one character of JSON strings becomes one character of std::string . However, for these Unicode characters, I really don't know what to do:

Should I just put the raw byte values ​​in my std::string like this:

 std::string mystr; mystr.push_back('\0x05'); mystr.push_back('\0xd9'); 

Or should I interpret the two characters with a library like iconv and instead store the UTF-8 result in my encoding instead?

Should I use std::wstring to store all characters? So what on * NIX OS, where wchar_t is 4 bytes long?

I feel that something is wrong with my decisions, but I do not understand that. What should I do in this situation?

+6
source share
2 answers

After some digging and thanks to the H2CO3 comments and Philips comments , I finally figured out how this should work:

RFC4627 Read, Section 3. Encoding :

  1. Encoding

    JSON text will be encoded in Unicode. The default encoding is UTF-8.

    Since the first two characters of the JSON text will always be ASCII characters [RFC0020], you can determine whether the octet will be UTF-8, UTF-16 (BE or LE), or UTF-32 (BE or LE), looking at the pattern of zeros in the first four octets.

      00 00 00 xx UTF-32BE 00 xx 00 xx UTF-16BE xx 00 00 00 UTF-32LE xx 00 xx 00 UTF-16LE xx xx xx xx UTF-8 

So, it seems that the JSON octet stream can be encoded in UTF-8, UTF-16, or UTF-32 (in both BE or LE for the last two).

Once this is clear, Section 2.5. Strings Section 2.5. Strings explains how to handle these \uXXXX values ​​in JSON strings:

Any character can be escaped. If the character is in the multilingual plane main mode (U + 0000 via U + FFFF), then it can be represented as a six-character sequence: the inverse solidus, followed by the lowercase u, followed by four hexadecimal digits that encode the code point of the character. Hexadecimal letters A, although F may be upper or lower case. So, for example, a string containing only one inverse solidus character can be represented as "\ U005C".

With more detailed explanations of characters not in Basic Multilingual Plane .

To avoid an extended character that is not in the Basic Multilingual Plane, the character is represented as a twelve-character sequence,
encoding a surrogate pair of UTF-16. So, for example, a string containing only the key symbol G (U + 1D11E) can be represented as "\ UD834 \ uDD1E".

Hope this helps.

+11
source

If I were you, I would use std :: string to store only UTF-8 and UTF-8. If the input JSON text does not contain any \ uXXXX sequences, std :: string can be used as a byte byte without conversion.

When you parse \ uXXXX, you can simply decode it and convert it to UTF-8, effectively processing it as if it were a true UTF-8 character in its place - this is what most JSON parsers do (libjson for sure).

Of course, with an approach that reads JSON with \ uXXXX and immediately flushes it back using your library, it is likely to lose the \ uXXXX sequences and replace them with its true UTF-8 views, but who really cares? Ultimately, the net result is still the same.

+2
source

All Articles