How do you get an array of Unicode codes from a .NET String?

Question

How do you get an array of Unicode codes from a .NET String?

I have a list of character range restrictions that I need to validate a string, but the char type in .NET is UTF-16, and so some characters become stupid (surrogate) pairs. Thus, when enumerating all char to string I do not get 32-bit Unicode code points, and some comparisons with high values do not work.

I understand Unicode well enough so that I can parse bytes myself, but I was looking for a BCL C # /. NET Framework solution. So...

How would you convert string to an array ( int[] ) of 32-bit Unicode code points?

+18

string c # char unicode astral-plane

Neil C. Obremski Mar 26 '09 at 20:03

source share

4 answers

You ask about code points. In UTF-16 (C # char ), there are only two possibilities:

The symbol is located on the base multilingual plane and is encoded in one block of code.
The character is outside the BMP and encoded using a high low code surrogate pair

Therefore, assuming the string is correct, this returns an array of code points for the given string:

 public static int[] ToCodePoints(string str) { if (str == null) throw new ArgumentNullException("str"); var codePoints = new List<int>(str.Length); for (int i = 0; i < str.Length; i++) { codePoints.Add(Char.ConvertToUtf32(str, i)); if (Char.IsHighSurrogate(str[i])) i += 1; } return codePoints.ToArray(); }

An example with a surrogate pair 🌀 and a linked symbol ñ :

 ToCodePoints("\U0001F300 El Ni\u006E\u0303o"); // 🌀 El Niño // { 0x1f300, 0x20, 0x45, 0x6c, 0x20, 0x4e, 0x69, 0x6e, 0x303, 0x6f } // 🌀 E l N in ̃◌ o

Here is another example. These two code points represent the 32nd musical note with a staccato accent, both surrogate pairs:

 ToCodePoints("\U0001D162\U0001D181"); // 𝅘𝅥𝅰𝆁 // { 0x1d162, 0x1d181 } // 𝅘𝅥𝅰 𝆁◌

When C-normalized , they decompose into a notebook, combine the stem, combine the flag and combine the accent-staccato, all surrogate pairs:

 ToCodePoints("\U0001D162\U0001D181".Normalize()); // 𝅘𝅥𝅰𝆁 // { 0x1d158, 0x1d165, 0x1d170, 0x1d181 } // 𝅘 𝅥 𝅰 𝆁◌

Please note that leppie solution is incorrect. It's about code points, not text elements. A text element is a combination of code points that together form a single grapheme. For example, in the above example, the string ñ in the string is represented by the Latin lower case n , followed by the combined tilde ̃◌ . Leppie's solution discards any character combinations that cannot be normalized to a single code point.

+15

Virtlink Jan 26 '15 at 17:12

source share

It doesn't seem like this should be much more complicated than this:

 public static IEnumerable<int> Utf32CodePoints( this IEnumerable<char> s ) { bool useBigEndian = !BitConverter.IsLittleEndian; Encoding utf32 = new UTF32Encoding( useBigEndian , false , true ) ; byte[] octets = utf32.GetBytes( s ) ; for ( int i = 0 ; i < octets.Length ; i+=4 ) { int codePoint = BitConverter.ToInt32(octets,i); yield return codePoint; } }

+3

Nicholas Carey Jan 26 '15 at 18:11

source share

I came up with the same approach proposed by Nicholas (and Jeppe), in short:

  public static IEnumerable<int> GetCodePoints(this string s) { var utf32 = new UTF32Encoding(!BitConverter.IsLittleEndian, false, true); var bytes = utf32.GetBytes(s); return Enumerable.Range(0, bytes.Length / 4).Select(i => BitConverter.ToInt32(bytes, i * 4)); }

Enumeration was all necessary, but getting the array is trivial:

 int[] codePoints = myString.GetCodePoints().ToArray();

0

Rich Armstrong Jul 19 '16 at 14:10

source share

leppie · Accepted Answer · 2009-03-26 20:28

This answer is incorrect. See @Virtlink Answer for the correct one.

 static int[] ExtractScalars(string s) { if (!s.IsNormalized()) { s = s.Normalize(); } List<int> chars = new List<int>((s.Length * 3) / 2); var ee = StringInfo.GetTextElementEnumerator(s); while (ee.MoveNext()) { string e = ee.GetTextElement(); chars.Add(char.ConvertToUtf32(e, 0)); } return chars.ToArray(); }

Notes . Compound characters require normalization.

How do you get an array of Unicode codes from a .NET String?

More articles: