Effective way to check if a Unicode string is NFC in Python?

I want to check if a string is already in NFC form. I am currently doing:

unicodedata.normalize('NFC', s) == s

I do this for a large number of lines, so I would like to be efficient. The above method seems wasteful. It converts to NFC and then performs string comparison.

Is there a more efficient way to do this? I thought:

len(unicodedata.normalize('NFC', s)) == len(s)

This avoids string comparisons. But I'm not sure that this is always correct. This works if NFC normalization always changes the length of a string other than NFC. Is this a valid assumption?

Any other ideas?

+4
source share
1 answer

. , 'Ω' (U + 2126) NFC 'Ξ©' (U + 03A9).

Unicode " " , , , Python unicodedata . , unicodedata.normalize() , , - ​​ .

, , Unicode Python (, PyICU).

+5

All Articles