What does .NET String.Normalize do?

The MSDN article on String.Normalize is simple:

Returns a new line whose binary representation is in a specific Unicode normalization form.

And sometimes referring to the "Unicode C. normalization form."

I'm just wondering what that means? How is this feature useful in real life situations?

+53
string
Jul 20 '10 at 8:17
source share
4 answers

It ensures that unicode strings can be compared for equality (even if they use different encodings in Unicode encoding).

From Unicode Standard Appendix No. 15 :

Essentially, Unicode Normalization Algorithm puts all combinations of labels in a given order and uses decomposition and composition rules to convert each line into one of the Unicode normalization forms. Then a binary comparison of the converted strings will determine equivalence.

+37
Jul 20 '10 at 8:22
source share

One of the differences between form C and form D is how letters with accents are represented: form C uses one code example with a letter with an accent, and form D divides this into letter and accent.

A side effect is that it makes it easy to create a โ€œremove accentsโ€ method.

public static string RemoveAccents(string input) { return new string( input .Normalize(System.Text.NormalizationForm.FormD) .ToCharArray() .Where(c => CharUnicodeInfo.GetUnicodeCategory(c) != UnicodeCategory.NonSpacingMark) .ToArray()); // the normalization to FormD splits accented letters in accents+letters // the rest removes those accents (and other non-spacing characters) } 
+44
Jul 20 '10 at 8:25
source share

In Unicode, a character (arranged) can have either a unique code point or a sequence of code points consisting of a base character and its accents.

Wikipedia lists, by way of example, Vietnamese B (U + 1EBF) and its decomposed sequence U + 0065 (e) U + 0302 (circumflex accent) U + 0301 (acute accent).

string.Normalize () converts between 4 normal forms, a string can be encoded in Unicode.

+6
Jul 20 '10 at 8:33
source share

This link has a good explanation:

http://unicode.org/reports/tr15/#Norm_Forms

From what I can assume, it can compare two Unicode strings for equality.

+5
Jul 20 '10 at 8:22
source share



All Articles