How do you get an array of Unicode codes from a .NET String?

I have a list of character range restrictions that I need to validate a string, but the char type in .NET is UTF-16, and so some characters become stupid (surrogate) pairs. Thus, when enumerating all char to string I do not get 32-bit Unicode code points, and some comparisons with high values ​​do not work.

I understand Unicode well enough so that I can parse bytes myself, but I was looking for a BCL C # /. NET Framework solution. So...

How would you convert string to an array ( int[] ) of 32-bit Unicode code points?

+18
string c # char unicode astral-plane
Mar 26 '09 at 20:03
source share
4 answers

This answer is incorrect. See @Virtlink Answer for the correct one.

 static int[] ExtractScalars(string s) { if (!s.IsNormalized()) { s = s.Normalize(); } List<int> chars = new List<int>((s.Length * 3) / 2); var ee = StringInfo.GetTextElementEnumerator(s); while (ee.MoveNext()) { string e = ee.GetTextElement(); chars.Add(char.ConvertToUtf32(e, 0)); } return chars.ToArray(); } 

Notes . Compound characters require normalization.

+9
Mar 26 '09 at 20:28
source share

You ask about code points. In UTF-16 (C # char ), there are only two possibilities:

  • The symbol is located on the base multilingual plane and is encoded in one block of code.
  • The character is outside the BMP and encoded using a high low code surrogate pair

Therefore, assuming the string is correct, this returns an array of code points for the given string:

 public static int[] ToCodePoints(string str) { if (str == null) throw new ArgumentNullException("str"); var codePoints = new List<int>(str.Length); for (int i = 0; i < str.Length; i++) { codePoints.Add(Char.ConvertToUtf32(str, i)); if (Char.IsHighSurrogate(str[i])) i += 1; } return codePoints.ToArray(); } 

An example with a surrogate pair πŸŒ€ and a linked symbol Γ± :

 ToCodePoints("\U0001F300 El Ni\u006E\u0303o"); // πŸŒ€ El NiΓ±o // { 0x1f300, 0x20, 0x45, 0x6c, 0x20, 0x4e, 0x69, 0x6e, 0x303, 0x6f } // πŸŒ€ E l N in Μƒβ—Œ o 

Here is another example. These two code points represent the 32nd musical note with a staccato accent, both surrogate pairs:

 ToCodePoints("\U0001D162\U0001D181"); // 𝅒𝆁 // { 0x1d162, 0x1d181 } // 𝅒 π†β—Œ 

When C-normalized , they decompose into a notebook, combine the stem, combine the flag and combine the accent-staccato, all surrogate pairs:

 ToCodePoints("\U0001D162\U0001D181".Normalize()); // π…˜π…₯𝅰𝆁 // { 0x1d158, 0x1d165, 0x1d170, 0x1d181 } // π…˜ π…₯ 𝅰 π†β—Œ 



Please note that leppie solution is incorrect. It's about code points, not text elements. A text element is a combination of code points that together form a single grapheme. For example, in the above example, the string Γ± in the string is represented by the Latin lower case n , followed by the combined tilde Μƒβ—Œ . Leppie's solution discards any character combinations that cannot be normalized to a single code point.

+15
Jan 26 '15 at 17:12
source share

It doesn't seem like this should be much more complicated than this:

 public static IEnumerable<int> Utf32CodePoints( this IEnumerable<char> s ) { bool useBigEndian = !BitConverter.IsLittleEndian; Encoding utf32 = new UTF32Encoding( useBigEndian , false , true ) ; byte[] octets = utf32.GetBytes( s ) ; for ( int i = 0 ; i < octets.Length ; i+=4 ) { int codePoint = BitConverter.ToInt32(octets,i); yield return codePoint; } } 
+3
Jan 26 '15 at 18:11
source share

I came up with the same approach proposed by Nicholas (and Jeppe), in short:

  public static IEnumerable<int> GetCodePoints(this string s) { var utf32 = new UTF32Encoding(!BitConverter.IsLittleEndian, false, true); var bytes = utf32.GetBytes(s); return Enumerable.Range(0, bytes.Length / 4).Select(i => BitConverter.ToInt32(bytes, i * 4)); } 

Enumeration was all necessary, but getting the array is trivial:

 int[] codePoints = myString.GetCodePoints().ToArray(); 
0
Jul 19 '16 at 14:10
source share



All Articles