You ask about code points. In UTF-16 (C # char ), there are only two possibilities:
- The symbol is located on the base multilingual plane and is encoded in one block of code.
- The character is outside the BMP and encoded using a high low code surrogate pair
Therefore, assuming the string is correct, this returns an array of code points for the given string:
public static int[] ToCodePoints(string str) { if (str == null) throw new ArgumentNullException("str"); var codePoints = new List<int>(str.Length); for (int i = 0; i < str.Length; i++) { codePoints.Add(Char.ConvertToUtf32(str, i)); if (Char.IsHighSurrogate(str[i])) i += 1; } return codePoints.ToArray(); }
An example with a surrogate pair π and a linked symbol Γ± :
ToCodePoints("\U0001F300 El Ni\u006E\u0303o"); // π El NiΓ±o // { 0x1f300, 0x20, 0x45, 0x6c, 0x20, 0x4e, 0x69, 0x6e, 0x303, 0x6f } // π E l N in Μβ o
Here is another example. These two code points represent the 32nd musical note with a staccato accent, both surrogate pairs:
ToCodePoints("\U0001D162\U0001D181"); // π
’π // { 0x1d162, 0x1d181 } // π
’ πβ
When C-normalized , they decompose into a notebook, combine the stem, combine the flag and combine the accent-staccato, all surrogate pairs:
ToCodePoints("\U0001D162\U0001D181".Normalize()); // π
π
₯π
°π // { 0x1d158, 0x1d165, 0x1d170, 0x1d181 } // π
π
₯ π
° πβ
Please note that leppie solution is incorrect. It's about code points, not text elements. A text element is a combination of code points that together form a single grapheme. For example, in the above example, the string Γ± in the string is represented by the Latin lower case n , followed by the combined tilde Μβ . Leppie's solution discards any character combinations that cannot be normalized to a single code point.