tl; dr: use the \X regex to extract user-readable characters:
>>> import regex
As long as I don’t know Thai, I know a little French.
Consider the letter è . Let s and s2 be equal è in the Python shell:
>>> s 'è' >>> s2 'è'
The same letter? For the French speaker, OWY. No for computer:
>>> s==s2 False
You can create the same letter either using the actual code point for è , or by taking the letter e and adding a combined code point that will add this accent character. They have different encodings:
>>> s.encode('utf-8') b'\xc3\xa8' >>> s2.encode('utf-8') b'e\xcc\x80'
And the length of different lengths:
>>> len(s) 1 >>> len(s2) 2
But visually both encodings lead to a "letter" è . This is called grapheme or what the end user considers a single character.
You can demonstrate the same loop behavior as you:
>>> [c for c in s] ['è'] >>> [c for c in s2] ['e', '̀']
There are several combination characters in your string. Therefore, a Thai string with 9 graphemes for your eyes turns into a 13-digit string in Python.
The solution in French is to normalize the string based on Unicode equivalence :
>>> from unicodedata import normalize >>> normalize('NFC', s2) == s True
This does not work for many non-Latin languages. An easy way to handle unicode strings, which can be multiple code points that make up a single grapheme , with a regex engine that handles this correctly, supporting \X Unfortunately, Python has included the re module yet .
The proposed replacement regex supports \X , though:
>>> import regex >>> text = 'เมื่อแรกเริ่ม' >>> regex.findall(r'\X', text) ['เ', 'มื่', 'อ', 'แ', 'ร', 'ก', 'เ', 'ริ่', 'ม'] >>> len(_) 9