C # and UTF-16 Symbols

Question

C # and UTF-16 Symbols

Is it possible that in C # to use UTF-32 characters not in plane 0 as char?

string s = ""; // valid char c = ''; // generates a compiler error ("Too many characters in character literal")

And in s it is represented by two characters, not one.

Edit: I mean, is there an AN string character type with full support for unicode, UTF-32 or UTF-8 per character? For example, if I want a for loop on utf-32 characters (maybe not in the 0 plane) in a string.

+7

c # unicode

Dutow Mar 30 '09 at 12:51

source share

3 answers

I only know this problem with Java and checked the char documentation before replying, and indeed, the behavior is pretty much the same in .NET / C # and Java.

It seems that char is defined as 16 bits and definitely cannot hold anything outside the plane 0. Only String / String is capable of handling these characters. In a char -array, it will be represented as two surrogate characters .

+4

Joachim sauer Mar 30 '09 at 12:58

source share

C # System.String supports UTF-32 just fine, but you cannot iterate over a string like this is a System.Char array or use IEnumerable.

eg:

 // iterating through a string NO UTF-32 SUPPORT for (int i = 0; i < sample.Length; ++i) { if (Char.IsDigit(sample[i])) { Console.WriteLine("IsDigit"); } else if (Char.IsLetter(sample[i])) { Console.WriteLine("IsLetter"); } } // iterating through a string WITH UTF-32 SUPPORT for (int i = 0; i < sample.Length; ++i) { if (Char.IsDigit(sample, i)) { Console.WriteLine("IsDigit"); } else if (Char.IsLetter(sample, i)) { Console.WriteLine("IsLetter"); } if (Char.IsSurrogate(sample, i)) { ++i; } }

Note the slight difference in the calls to Char.IsDigit and Char.IsLetter. And that String.Length is always the number of 16-bit "characters", not the number of "characters" in the sense of UTF-32.

Disable the theme, but UTF-32 support is completely unnecessary for an application for processing international languages, unless you have a specific business case for an obscure historical / technical language.

+3

Luke tigaris May 09 '09 at 15:13

source share

Emperor XLII · Accepted Answer · 2009-05-11T23:29:14+0000

The string class represents encoded text in UTF-16 format and each char in string represents a UTF-16 code value.

Although there is no BCL type, which is a single Unicode code point, support for Unicode characters outside the 0 plane is supported as method overloads using string and index instead of simple char . For example, the static method GetUnicodeCategory & # xfeff; (char) on System.Globalization.CharUnicodeInfo has the corresponding GetUnicodeCategory & # xfeff; (string,int) that recognizes a simple character or a surrogate pair starting at the specified index.

To iterate through the text elements in a string , you can use the methods on the System.Globalization.StringInfo class. Here, a "text element" corresponds to a single character as displayed on screen. This means that simple characters ( "a" ), combining characters ( "a\u0304\u0308" = "ā̈"), and surrogate pairs ( "\uD950\uDF21" = "") will all be treated as a single text element .

In particular, the static method

C # and UTF-16 Symbols

More articles: