Defining a 4-byte UTF-16 character in a string

Question

Defining a 4-byte UTF-16 character in a string

I read the question about UTF-8, UTF-16, and UCS-2 , and almost all the answers give the statement that UCS-2 is outdated, and C # uses UTF-16.

However, all my attempts to create a 4-byte U + 1D11E character in C # failed, so I believe that C # only uses UCS-2 subsets of UTF-16.

There are my attempts:

string s = "\u1D11E"; // gives the 2 character string "ᴑE", because \u1D11 is ᴑ string s = (char) 0x1D11E; // won't compile because of an overflow string s = Encoding.Unicode.GetString(new byte[] {0xD8, 0x34, 0xDD, 0x1E}); // gives 㓘ờ

Are C # strings really UTF-16 or are they actually UCS-2? If they are UTF-16, how would I get the treble clef in my C # line?

+6

c # encoding unicode character-encoding utf-16

Thomas weller Jan 01 '14 at 23:38

source share

3 answers

C # definitely uses UTF-16. The correct way to define characters over the range U + 0000 - U + FFFF uses an escape sequence that allows you to define characters using 8 hexadecimal digits:

 string s = "\U0001D11E";

If you use \u1D11E , it is interpreted as the character U+1D11 , followed by E

When using these characters, keep in mind that the String.Length property, and most string methods work with UTF-16 code units, not Unicode characters. From the MSDN documentation:

The Length property returns the number of Char objects in this instance, and not the number of Unicode characters. The reason is that a Unicode character can be represented by more than one Char. Use the System.Globalization.StringInfo class to work with each Unicode character instead of each Char.

+5

Joni Jan 01 '14 at 23:48

source share

According to the C # specification, characters longer than 4 hexadecimal characters are encoded using \U (uppercase U) and 8 hexadecimal characters. After the correct encoding in a string, it can be correctly exported using any Unicode encoding;

 string s = "\U0001D11E"; foreach (var b in Encoding.UTF32.GetBytes(s)) Console.WriteLine(b.ToString("x2")); Console.WriteLine(); foreach (var b in Encoding.Unicode.GetBytes(s)) Console.WriteLine(b.ToString("x2")); > 1e > d1 > 01 > 00 > > 34 > d8 > 1e > dd

+2

Joachim Isaksson Jan 01 '14 at 23:59

source share

Hans passant · Accepted Answer · 2014-01-02T00:00:20+0000

Use capital U instead:

  string s = "\U0001D11E";

And you lose sight of the fact that most cars are not very similar:

  string t = Encoding.Unicode.GetString(new byte[] { 0x34, 0xD8, 0x1E, 0xDD });

Defining a 4-byte UTF-16 character in a string

More articles: