Fatal error: invalid code points with high and low surrogates. Invalid Unicode scalar values

Sometimes when initializing UnicodeScalar with a value similar to 57292 , the following error occurs:

 fatal error: high- and low-surrogate code points are not valid Unicode scalar values 

What is this mistake, why is this happening, and how can I prevent it in the future?

+5
source share
1 answer

Background: UTF-16 represents a sequence of Unicode characters ("code points") as a sequence of 16-bit "code units". For characters whose scalar values ​​correspond to 16 bits (i.e., From U + 0000 to U + FFFF), the code block has the same meaning as the character; but for characters outside this range (from U + 10000 to U + 10FFFF) UTF-16 must use two blocks of code. To do this, Unicode reserves a number of code points (U + D800 to U + DFFF) as "surrogates" that cannot be used as characters; UTF-16 can then use two of these surrogates to represent a code point outside the 16-bit range. (“High” and “low” refer to surrogates, which serve as the first and second code units in these pairs, respectively. Each surrogate is either a high surrogate or a low surrogate, but not both: experience with older character sets showed that it is very useful always be able to tell where one character ends and the next begins.)

So the problem you see is that you are trying to create a UnicodeScalar with a value of (U + DFCC), which, according to the Unicode standard, is reserved so as not to be a Unicode scalar. U + DFCC is defined as non-existent and is simply “surrogate” for half the scalar that exists.

To prevent this problem, you need to stick to scalars that really exist - U + 0000 - U + D7FF and U + E000 - U + 10FFFF.

+7
source

All Articles