Why are these Thai characters displayed on a long tail web page?

Question

Why are these Thai characters displayed on a long tail web page?

ด ้้้้้็็็็็้้้้้็็็็็้้้้้็็็็็้้้้้็็็็็้้้้้็็็็็้้้้้็็็็็้้้้้็็็็็้้้้้ d ด ็็็็็้้้้้็็็็้้้้้็็็็็้้้้้็็็็็้้้้้็็็็็้้้้้

I found some interesting characters in the same way as I inserted above, which occupy only 3 spaces. However, the actual string length is 380.

I checked the string in python and the encode string looks like this:

'\ xe0 \ XB8 \ x94 \ xe0 \ xb9 \ x89 \ xe0 \ xb9 \ x89 \ xe0 \ xb9 \ x89 \ xe0 \ xb9 \ x89 \ xe0 \ xb9 \ x89 \ xe0 \ xb9 \ x87 \ xe0 \ xb9 \ x87 \ xe0 \ xb9 \ x87 \ xe0 \ xb9 \ x87 \ xe0 \ xb9 \ x87 \ xe0 \ xb9 \ x89 \ xe0 \ xb9 \ x89 \ xe0 \ xb9 \ x89 \ xe0 \ xb9 \ x89 \ xe0 \ xb9 \ x89 \ xe0 \ xb9 \ x87 \ xe0 \ xb9 \ x87 \ xe0 \ xb9 \ x87 \ xe0 \ xb9 \ x87 \ xe0 \ xb9 \ x87 \ xe0 \ xb9 \ x89 \ xe0 \ xb9 \ x89 \ xe0 \ xb9 \ x89 \ xe0 \ xb9 \ x89 \ xe0 \ xb9 \ x89 \ xe0 \ xb9 \ x87 \ xe0 \ xb9 \ x87 \ xe0 \ xb9 \ x87 \ xe0 \ xb9 \ x87 \ xe0 \ xb9 \ x87 \ xe0 \ xb9 \ x89 \ xe0 \ xb9 \ x89 \ xe0 \ xb9 \ x89 \ xe0 \ xb9 \ x89 \ xe0 \ xb9 \ x89 \ xe0 \ xb9 \ x87 \ xe0 \ xb9 \ x87 \ xe0 \ xb9 \ x87 \ xe0 \ xb9 \ x87 \ xe0 \ xb9 \ x87 \ xe0 \ xb9 \ x89 \ xe0 \ xb9 \ x89 \ xe0 \ xb9 \ x89 \ xe0 \ xb9 \ x89 \ xe0 \ xb9 \ x89 \ xe0 \ xb9 \ x87 \ xe0 \ xb9 \ x87 \ xe0 \ xb9 \ x87 \ xe0 \ xb9 \ x87 \ xe0 \ xb9 \ x87 \ xe0 \ xb9 \ x89 \ xe0 \ xb9 \ x89 \ xe0 \ xb9 \ x89 \ xe0 \ xb9 \ x89 \ xe0 \ xb9 \ x89 \ xe0 \ xb9 \ x87 \ xe0 \ xb9 \ x87 \ xe0 \ xb9 \ x87 \ xe0 \ xb9 \ x87 \ xe0 \ xb9 \ x87 \ xe0 \ xb9 \ x89 \ xe0 \ xb9 \ x89 \ xe0 \ xb9 \ x89 \ xe0 \ xb9 \ x89 \ xe0 \ xb9 \ x89 \ xe0 \ xb9 \ x87 \ xe0 \ xb9 \ x87 \ xe0 \ xb9 \ x87 \ xe0 \ xb9 \ x87 \ xe0 \ xb9 \ x87 \ xe0 \ xb9 \ x89 \ xe0 \ xb9 \ x89 \ xe0 \ xb9 \ x89 \ xe0 \ xb9 \ x89 \ xe0 \ xb9 \ x89 \ xd0 \ XB4 \ xe0 \ XB8 \ x94 \ xe0 \ xb9 \ x87 \ xe0 \ xb9 \ x87 \ xe0 \ xb9 \ x87 \ xe0 \ xb9 \ x87 \ xe0 \ xb9 \ x87 \ xe0 \ X b9 \ x89 \ xe0 \ xb9 \ x89 \ xe0 \ xb9 \ x89 \ xe0 \ xb9 \ x89 \ xe0 \ xb9 \ x89 \ xe0 \ xb9 \ x87 \ xe0 \ xb9 \ x87 \ xe0 \ xb9 \ x87 \ xe0 \ xb9 \ x87 \ xe0 \ xb9 \ x89 \ xe0 \ xb9 \ x89 \ xe0 \ xb9 \ x89 \ xe0 \ xb9 \ x89 \ xe0 \ xb9 \ x89 \ xe0 \ xb9 \ x87 \ xe0 \ xb9 \ x87 \ xe0 \ xb9 \ x87 \ xe0 \ xb9 \ x87 \ xe0 \ xb9 \ x87 \ xe0 \ xb9 \ x89 \ xe0 \ xb9 \ x89 \ xe0 \ xb9 \ x89 \ xe0 \ xb9 \ x89 \ xe0 \ xb9 \ x89 \ xe0 \ xb9 \ x87 \ xe0 \ xb9 \ x87 \ xe0 \ xb9 \ x87 \ xe0 \ xb9 \ x87 \ xe0 \ xb9 \ x87 \ xe0 \ xb9 \ x89 \ xe0 \ xb9 \ x89 \ xe0 \ xb9 \ x89 \ xe0 \ xb9 \ x89 \ xe0 \ xb9 \ x89 \ xe0 \ xb9 \ x87 \ xe0 \ xb9 \ x87 \ xe0 \ xb9 \ x87 \ xe0 \ xb9 \ x87 \ xe0 \ xb9 \ x87 \ xe0 \ xb9 \ x89 \ xe0 \ xb9 \ x89 \ xe0 \ xb9 \ x89 \ xe0 \ xb9 \ x89 \ xe0 \ xb9 \ x89 '

The string seems to be a combination of three Thai characters:

ด \xe0\xb8\x94 THAI CHARACTER DO DEK ้ \xe0\xb9\x89 THAI CHARACTER MAI THO ็ \xe0\xb9\x87 THAI CHARACTER MAITAIKHU

And my questions are:

Why is the behavior of these characters different, is this a mistake?
How can I avoid this on the site (possibly with some html filter)?

UPDATE

I tested characters with a large number of browsers, and the long tail appeared only on chrome and firefox on the Windows platform.

Below is a screenshot:

win 7 ie8

ubuntu firefox

win 7 chrome

win 7 firefox

Therefore, I think this is a browser related error.

+25

unicode

xiao 啸 Aug 19 2018-11-11T00:

source share

4 answers

koan · Answer 1 · 2011-08-19 10:19

There are two problems: one in the output system (font renderer), which is not Thai, and one in the input system, which generated this text in the first place.

If you did your homework, you would know that mai tho and maitaikhu (UniCode names) are what UniCode is called Non Spacing (NSM) markers. This means that the font renderer should not move to the next cell of a character when that character is displayed.

To avoid the clutter you see above, the Thai API Consortium (TAPIC) made the WTT 2.0 standard, which describes how the font rendering algorithm should handle the Thai letter order when it receives it as an input, and also how the input method should allow enter these characters if you are trying to enter them.

Standardization and implementation of the Thai language review

libthai includes both input and output methods.

thaicheck is a small program that can detect letter sequence problems and fix them.

By the way, you cannot have the sequence (word) do dek, mai tho and maitaikhu; input sequence is noise.

Keep in mind that some editors have violated input methods that allow you to print multiple NSMs that cannot be combined, but the output method will only display legal sequences; the result is an invalid input line that looks normal to the user on his system.

devio · Answer 2 · 2011-08-19 10:18

The codes you mentioned are all in UTF-8, so each character needs 3 bytes. Unicode Code of Respect:

DO DEK 0x0e14
MAI THO 0x0e49
MAITAIKHU 0x0e47

The last two are in the Mark, Nonspacing and have the Combine ( Canonical_Combining_Class ) property equal to 107, which means that the code points are merged with the previous code point in the rendering.

For example, an example starts with a single character and adds a lot of uneven marks to it.

Compare with this C # code:

 char DODEK = (char)0x0e14; char MAITHO = (char)0x0e49; char MAITAIKHU = (char)0x0e47; string thai = new string(new char[] { DODEK, MAITHO, MAITAIKHU }); Console.WriteLine("number of code points: " + thai.Length); var si = new System.Globalization.StringInfo(thai); Console.WriteLine("number of text elements: " + si.LengthInTextElements);

Output:

 number of code points: 3 number of text elements: 1

Why are these Thai characters displayed on a long tail web page?

More articles: