Separate Thai text by characters

Not beyond the boundaries of words, it is solvable.

Example:

#!/usr/bin/env python3 text = 'เมื่อแรกเริ่ม' for char in text: print(char) 

It produces:






This is obviously not the desired result. Any ideas?

Portable text presentation:

 text = u'\u0e40\u0e21\u0e37\u0e48\u0e2d\u0e41\u0e23\u0e01\u0e40\u0e23\u0e34\u0e48\u0e21' 
+5
source share
3 answers

tl; dr: use the \X regex to extract user-readable characters:

 >>> import regex # $ pip install regex >>> regex.findall(u'\\X', u'เมื่อแรกเริ่ม') ['เ', 'มื่', 'อ', 'แ', 'ร', 'ก', 'เ', 'ริ่', 'ม'] 

As long as I don’t know Thai, I know a little French.

Consider the letter è . Let s and s2 be equal è in the Python shell:

 >>> s 'è' >>> s2 'è' 

The same letter? For the French speaker, OWY. No for computer:

 >>> s==s2 False 

You can create the same letter either using the actual code point for è , or by taking the letter e and adding a combined code point that will add this accent character. They have different encodings:

 >>> s.encode('utf-8') b'\xc3\xa8' >>> s2.encode('utf-8') b'e\xcc\x80' 

And the length of different lengths:

 >>> len(s) 1 >>> len(s2) 2 

But visually both encodings lead to a "letter" è . This is called grapheme or what the end user considers a single character.

You can demonstrate the same loop behavior as you:

 >>> [c for c in s] ['è'] >>> [c for c in s2] ['e', '̀'] 

There are several combination characters in your string. Therefore, a Thai string with 9 graphemes for your eyes turns into a 13-digit string in Python.

The solution in French is to normalize the string based on Unicode equivalence :

 >>> from unicodedata import normalize >>> normalize('NFC', s2) == s True 

This does not work for many non-Latin languages. An easy way to handle unicode strings, which can be multiple code points that make up a single grapheme , with a regex engine that handles this correctly, supporting \X Unfortunately, Python has included the re module yet .

The proposed replacement regex supports \X , though:

 >>> import regex >>> text = 'เมื่อแรกเริ่ม' >>> regex.findall(r'\X', text) ['เ', 'มื่', 'อ', 'แ', 'ร', 'ก', 'เ', 'ริ่', 'ม'] >>> len(_) 9 
+8
source

I can’t exactly reproduce it, but here is a small version of the script that you changed, with the release of IDLE 3.4 on Windows7 64:

 >>> for char in text: print(char, hex(ord(char)), unicodedata.name(char),'-', unicodedata.category(char), '-', unicodedata.combining(char), '-', unicodedata.east_asian_width(char)) เ 0xe40 THAI CHARACTER SARA E - Lo - 0 - N ม 0xe21 THAI CHARACTER MO MA - Lo - 0 - N ื 0xe37 THAI CHARACTER SARA UEE - Mn - 0 - N ่ 0xe48 THAI CHARACTER MAI EK - Mn - 107 - N อ 0xe2d THAI CHARACTER O ANG - Lo - 0 - N แ 0xe41 THAI CHARACTER SARA AE - Lo - 0 - N ร 0xe23 THAI CHARACTER RO RUA - Lo - 0 - N ก 0xe01 THAI CHARACTER KO KAI - Lo - 0 - N เ 0xe40 THAI CHARACTER SARA E - Lo - 0 - N ร 0xe23 THAI CHARACTER RO RUA - Lo - 0 - N ิ 0xe34 THAI CHARACTER SARA I - Mn - 0 - N ่ 0xe48 THAI CHARACTER MAI EK - Mn - 107 - N ม 0xe21 THAI CHARACTER MO MA - Lo - 0 - N >>> 

I really do not know what these characters can be: my Thai is very poor :-) - but it shows that:

  • text recognized as thai ... Exit
  • is coherent with len(text) ( 13 ) Category
  • and combining differ when combining characters

If this is the expected conclusion, it proves that your problem is not in Python, but more in the console where you display it. You should try to redirect the output to a file and then open the file in a unicode editor that supports Thai characters.

If the expected output will be only 9 characters, that is, if you do not want to decompose the arranged characters, and provided that there are no other compilation rules to consider, you can use something like:

 def Thaidump(t): old = None for i in t: if unicodedata.category(i) == 'Mn': if old is not None: old = old + i else: if old is not None: print(old) old = i print(old) 

Thus:

 >>> Thaidump(text) เ มื่ อ แ ร ก เ ริ่ ม >>> 
+2
source

To clarify the previous answers, the problem is that the missing characters are “combined characters” - vowels and diacritics that must be combined with other characters for proper display. There is no standard way to display these characters on their own, although the most common convention is to use the dotted circle as a zero consonant, as shown in Serge Ballst's answer.

The question is, for your application, each vowel and diacritical mark is considered a separate character or do you want to separate the “print cell”, as shown in Serge’s answer?

By the way, in normal use, the main vowels SARA E and SARA AE should not be displayed without the next consonant, except in the process of entering a longer word.

For more information, see the WTT 2.0 standard published by the Thai API Consortium (TAPIC), which defines how characters can be combined, displayed, and how to handle errors.

+2
source

All Articles