Simplified scenario
TextElementEnumerator is very useful and efficient:
private static List<SoundCount> CountSounds(IEnumerable<string> words) { Dictionary<string, SoundCount> soundCounts = new Dictionary<string, SoundCount>(); foreach (var word in words) { TextElementEnumerator graphemeEnumerator = StringInfo.GetTextElementEnumerator(word); while (graphemeEnumerator.MoveNext()) { string grapheme = graphemeEnumerator.GetTextElement(); SoundCount count; if (!soundCounts.TryGetValue(grapheme, out count)) { count = new SoundCount() { Sound = grapheme }; soundCounts.Add(grapheme, count); } count.Count++; } } return new List<SoundCount>(soundCounts.Values); }
You can also do this using a regular expression: (From the documentation, TextElementEnumerator handles several cases where the expression is not indicated below, especially extra characters, but this is quite rare and is not required in any case for my application.)
private static List<SoundCount> CountSoundsRegex(IEnumerable<string> words) { var soundCounts = new Dictionary<string, SoundCount>(); var graphemeExpression = new Regex(@"\P{M}\p{M}*"); foreach (var word in words) { Match graphemeMatch = graphemeExpression.Match(word); while (graphemeMatch.Success) { string grapheme = graphemeMatch.Value; SoundCount count; if (!soundCounts.TryGetValue(grapheme, out count)) { count = new SoundCount() { Sound = grapheme }; soundCounts.Add(grapheme, count); } count.Count++; graphemeMatch = graphemeMatch.NextMatch(); } } return new List<SoundCount>(soundCounts.Values); }
Performance: In my testing, I found that TextElementEnumerator was about 4 times faster than a regular expression.
Realistic scenario
Unfortunately, there is no way to “tweak” as TextElementEnumerator enumerates, so the class will be useless in a realistic scenario.
One solution is to customize our regex:
[\P{M}\P{Lm}]
We could also create our own IEnumerator <string> using CharUnicodeInfo.GetUnicodeCategory to restore our performance, but it seems to me that this is too much work for me and additional code to support. (Does anyone else want to go?) For this, regular expressions are created.
Dave mateer
source share