Enumerating a string of graphemes instead of a character

Strings are usually enumerated by a character. But, especially when working with Unicode and non-English languages, sometimes I need to list a grapheme string. That is, a combination of marks and diacritics must be supported with the base character that they change. What is the best way to do this in .Net?

Use case: Count individual phonetic sounds in the IPA series.

  • Simplified definition: There is a one-to-one relationship between grapheme and sound.
  • Realistic definition: Special alphabetic characters should also be included in the base character (for example, pʰ), and some sounds may be represented by two characters connected by a connecting rod (kpp).
+6
string unicode
source share
2 answers

Simplified scenario

TextElementEnumerator is very useful and efficient:

private static List<SoundCount> CountSounds(IEnumerable<string> words) { Dictionary<string, SoundCount> soundCounts = new Dictionary<string, SoundCount>(); foreach (var word in words) { TextElementEnumerator graphemeEnumerator = StringInfo.GetTextElementEnumerator(word); while (graphemeEnumerator.MoveNext()) { string grapheme = graphemeEnumerator.GetTextElement(); SoundCount count; if (!soundCounts.TryGetValue(grapheme, out count)) { count = new SoundCount() { Sound = grapheme }; soundCounts.Add(grapheme, count); } count.Count++; } } return new List<SoundCount>(soundCounts.Values); } 

You can also do this using a regular expression: (From the documentation, TextElementEnumerator handles several cases where the expression is not indicated below, especially extra characters, but this is quite rare and is not required in any case for my application.)

 private static List<SoundCount> CountSoundsRegex(IEnumerable<string> words) { var soundCounts = new Dictionary<string, SoundCount>(); var graphemeExpression = new Regex(@"\P{M}\p{M}*"); foreach (var word in words) { Match graphemeMatch = graphemeExpression.Match(word); while (graphemeMatch.Success) { string grapheme = graphemeMatch.Value; SoundCount count; if (!soundCounts.TryGetValue(grapheme, out count)) { count = new SoundCount() { Sound = grapheme }; soundCounts.Add(grapheme, count); } count.Count++; graphemeMatch = graphemeMatch.NextMatch(); } } return new List<SoundCount>(soundCounts.Values); } 

Performance: In my testing, I found that TextElementEnumerator was about 4 times faster than a regular expression.

Realistic scenario

Unfortunately, there is no way to “tweak” as TextElementEnumerator enumerates, so the class will be useless in a realistic scenario.

One solution is to customize our regex:

 [\P{M}\P{Lm}] # Match a character that is NOT a character intended to be combined with another character or a special character that is used like a letter (?: # Start a group for the combining characters: (?: # Start a group for tied characters: [\u035C\u0361] # Match an under- or over- tie bar... \P{M}\p{M}* # ...followed by another grapheme (in the simplified sense) ) # (End the tied characters group) |\p{M} # OR a character intended to be combined with another character |\p{Lm} # OR a special character that is used like a letter )* # Match the combining characters group zero or more times. 

We could also create our own IEnumerator <string> using CharUnicodeInfo.GetUnicodeCategory to restore our performance, but it seems to me that this is too much work for me and additional code to support. (Does anyone else want to go?) For this, regular expressions are created.

+6
source share

I'm not sure what exactly you are looking for, but is not your question related to Unicode normalization?

When a line is normalized to Unicode Form C (which is the default form), the diacritics and the characters that they change are combined, so if you list the characters, you get the base and modifier characters.

When it normalizes to form D, the base and modification characters are separated and returned separately in the enumeration.

See String.Normalize more details.

+1
source share

All Articles