UTF-16 safe substring in C # .NET

I want a substring of a given length to say 150. However, I want me not to cut the string between the unicode character.

eg. see the following code:

var str = "Hello😀 world!"; var substr = str.Substring(0, 6); 

Here substr is an invalid string, because the emoticon character is cut in half.

Instead, I want a function that does the following:

 var str = "Hello😀 world!"; var substr = str.UnicodeSafeSubstring(0, 6); 

where substr contains "Hello😀"

For reference, here is how I would do it in Objective-C using rangeOfComposedCharacterSequencesForRange

 NSString* str = @"Hello😀 world!"; NSRange range = [message rangeOfComposedCharacterSequencesForRange:NSMakeRange(0, 6)]; NSString* substr = [message substringWithRange:range]]; 

What is equivalent code in C #?

+6
source share
2 answers

This should return the maximum substring starting at the startIndex index and up to the length "full" graphemes ... Thus, the original / final "separated" surrogate pairs will be deleted, the original combination of labels will be deleted, the final characters that do not have their combining labels, will be deleted.

Please note that this is probably not what you asked for ... It seems you want to use graphemes as a unit of measure (or maybe you want to include the last grapheme even if its length exceeds the length parameter)

 public static class StringEx { public static string UnicodeSafeSubstring(this string str, int startIndex, int length) { if (str == null) { throw new ArgumentNullException("str"); } if (startIndex < 0 || startIndex > str.Length) { throw new ArgumentOutOfRangeException("startIndex"); } if (length < 0) { throw new ArgumentOutOfRangeException("length"); } if (startIndex + length > str.Length) { throw new ArgumentOutOfRangeException("length"); } if (length == 0) { return string.Empty; } var sb = new StringBuilder(length); int end = startIndex + length; var enumerator = StringInfo.GetTextElementEnumerator(str, startIndex); while (enumerator.MoveNext()) { string grapheme = enumerator.GetTextElement(); startIndex += grapheme.Length; if (startIndex > length) { break; } // Skip initial Low Surrogates/Combining Marks if (sb.Length == 0) { if (char.IsLowSurrogate(grapheme[0])) { continue; } UnicodeCategory cat = char.GetUnicodeCategory(grapheme, 0); if (cat == UnicodeCategory.NonSpacingMark || cat == UnicodeCategory.SpacingCombiningMark || cat == UnicodeCategory.EnclosingMark) { continue; } } sb.Append(grapheme); if (startIndex == length) { break; } } return sb.ToString(); } } 

An option that will simply include the "extra" characters at the end of the substring, if necessary, to make the whole grapheme:

 public static class StringEx { public static string UnicodeSafeSubstring(this string str, int startIndex, int length) { if (str == null) { throw new ArgumentNullException("str"); } if (startIndex < 0 || startIndex > str.Length) { throw new ArgumentOutOfRangeException("startIndex"); } if (length < 0) { throw new ArgumentOutOfRangeException("length"); } if (startIndex + length > str.Length) { throw new ArgumentOutOfRangeException("length"); } if (length == 0) { return string.Empty; } var sb = new StringBuilder(length); int end = startIndex + length; var enumerator = StringInfo.GetTextElementEnumerator(str, startIndex); while (enumerator.MoveNext()) { if (startIndex >= length) { break; } string grapheme = enumerator.GetTextElement(); startIndex += grapheme.Length; // Skip initial Low Surrogates/Combining Marks if (sb.Length == 0) { if (char.IsLowSurrogate(grapheme[0])) { continue; } UnicodeCategory cat = char.GetUnicodeCategory(grapheme, 0); if (cat == UnicodeCategory.NonSpacingMark || cat == UnicodeCategory.SpacingCombiningMark || cat == UnicodeCategory.EnclosingMark) { continue; } } sb.Append(grapheme); } return sb.ToString(); } } 

This will return what you set to "Hello😀 world!".UnicodeSafeSubstring(0, 6) == "Hello😀" .

+4
source

It looks like you want to split the string into graphemes, that is, into separate displayed characters.

In this case, you have a convenient method: StringInfo.SubstringByTextElements :

 var str = "Hello😀 world!"; var substr = new StringInfo(str).SubstringByTextElements(0, 6); 
+3
source

All Articles