Why are these Thai characters displayed on a long tail web page?

ด ้้้้้็็็็็้้้้้็็็็็้้้้้็็็็็้้้้้็็็็็้้้้้็็็็็้้้้้็็็็็้้้้้็็็็็้้้้้ d ด ็็็็็้้้้้็็็็้้้้้็็็็็้้้้้็็็็็้้้้้็็็็็้้้้้

I found some interesting characters in the same way as I inserted above, which occupy only 3 spaces. However, the actual string length is 380.

I checked the string in python and the encode string looks like this:

'\ xe0 \ XB8 \ x94 \ xe0 \ xb9 \ x89 \ xe0 \ xb9 \ x89 \ xe0 \ xb9 \ x89 \ xe0 \ xb9 \ x89 \ xe0 \ xb9 \ x89 \ xe0 \ xb9 \ x87 \ xe0 \ xb9 \ x87 \ xe0 \ xb9 \ x87 \ xe0 \ xb9 \ x87 \ xe0 \ xb9 \ x87 \ xe0 \ xb9 \ x89 \ xe0 \ xb9 \ x89 \ xe0 \ xb9 \ x89 \ xe0 \ xb9 \ x89 \ xe0 \ xb9 \ x89 \ xe0 \ xb9 \ x87 \ xe0 \ xb9 \ x87 \ xe0 \ xb9 \ x87 \ xe0 \ xb9 \ x87 \ xe0 \ xb9 \ x87 \ xe0 \ xb9 \ x89 \ xe0 \ xb9 \ x89 \ xe0 \ xb9 \ x89 \ xe0 \ xb9 \ x89 \ xe0 \ xb9 \ x89 \ xe0 \ xb9 \ x87 \ xe0 \ xb9 \ x87 \ xe0 \ xb9 \ x87 \ xe0 \ xb9 \ x87 \ xe0 \ xb9 \ x87 \ xe0 \ xb9 \ x89 \ xe0 \ xb9 \ x89 \ xe0 \ xb9 \ x89 \ xe0 \ xb9 \ x89 \ xe0 \ xb9 \ x89 \ xe0 \ xb9 \ x87 \ xe0 \ xb9 \ x87 \ xe0 \ xb9 \ x87 \ xe0 \ xb9 \ x87 \ xe0 \ xb9 \ x87 \ xe0 \ xb9 \ x89 \ xe0 \ xb9 \ x89 \ xe0 \ xb9 \ x89 \ xe0 \ xb9 \ x89 \ xe0 \ xb9 \ x89 \ xe0 \ xb9 \ x87 \ xe0 \ xb9 \ x87 \ xe0 \ xb9 \ x87 \ xe0 \ xb9 \ x87 \ xe0 \ xb9 \ x87 \ xe0 \ xb9 \ x89 \ xe0 \ xb9 \ x89 \ xe0 \ xb9 \ x89 \ xe0 \ xb9 \ x89 \ xe0 \ xb9 \ x89 \ xe0 \ xb9 \ x87 \ xe0 \ xb9 \ x87 \ xe0 \ xb9 \ x87 \ xe0 \ xb9 \ x87 \ xe0 \ xb9 \ x87 \ xe0 \ xb9 \ x89 \ xe0 \ xb9 \ x89 \ xe0 \ xb9 \ x89 \ xe0 \ xb9 \ x89 \ xe0 \ xb9 \ x89 \ xe0 \ xb9 \ x87 \ xe0 \ xb9 \ x87 \ xe0 \ xb9 \ x87 \ xe0 \ xb9 \ x87 \ xe0 \ xb9 \ x87 \ xe0 \ xb9 \ x89 \ xe0 \ xb9 \ x89 \ xe0 \ xb9 \ x89 \ xe0 \ xb9 \ x89 \ xe0 \ xb9 \ x89 \ xd0 \ XB4 \ xe0 \ XB8 \ x94 \ xe0 \ xb9 \ x87 \ xe0 \ xb9 \ x87 \ xe0 \ xb9 \ x87 \ xe0 \ xb9 \ x87 \ xe0 \ xb9 \ x87 \ xe0 \ X b9 \ x89 \ xe0 \ xb9 \ x89 \ xe0 \ xb9 \ x89 \ xe0 \ xb9 \ x89 \ xe0 \ xb9 \ x89 \ xe0 \ xb9 \ x87 \ xe0 \ xb9 \ x87 \ xe0 \ xb9 \ x87 \ xe0 \ xb9 \ x87 \ xe0 \ xb9 \ x89 \ xe0 \ xb9 \ x89 \ xe0 \ xb9 \ x89 \ xe0 \ xb9 \ x89 \ xe0 \ xb9 \ x89 \ xe0 \ xb9 \ x87 \ xe0 \ xb9 \ x87 \ xe0 \ xb9 \ x87 \ xe0 \ xb9 \ x87 \ xe0 \ xb9 \ x87 \ xe0 \ xb9 \ x89 \ xe0 \ xb9 \ x89 \ xe0 \ xb9 \ x89 \ xe0 \ xb9 \ x89 \ xe0 \ xb9 \ x89 \ xe0 \ xb9 \ x87 \ xe0 \ xb9 \ x87 \ xe0 \ xb9 \ x87 \ xe0 \ xb9 \ x87 \ xe0 \ xb9 \ x87 \ xe0 \ xb9 \ x89 \ xe0 \ xb9 \ x89 \ xe0 \ xb9 \ x89 \ xe0 \ xb9 \ x89 \ xe0 \ xb9 \ x89 \ xe0 \ xb9 \ x87 \ xe0 \ xb9 \ x87 \ xe0 \ xb9 \ x87 \ xe0 \ xb9 \ x87 \ xe0 \ xb9 \ x87 \ xe0 \ xb9 \ x89 \ xe0 \ xb9 \ x89 \ xe0 \ xb9 \ x89 \ xe0 \ xb9 \ x89 \ xe0 \ xb9 \ x89 '

The string seems to be a combination of three Thai characters:

ด \xe0\xb8\x94 THAI CHARACTER DO DEK ้ \xe0\xb9\x89 THAI CHARACTER MAI THO ็ \xe0\xb9\x87 THAI CHARACTER MAITAIKHU 

And my questions are:

  • Why is the behavior of these characters different, is this a mistake?
  • How can I avoid this on the site (possibly with some html filter)?

UPDATE

I tested characters with a large number of browsers, and the long tail appeared only on chrome and firefox on the Windows platform.

Below is a screenshot:

win 7 ie8 win 7 ie8




ubuntu firefox ubuntu firefox




win 7 chrome win 7 chrome




win 7 firefox win 7 firefox




Therefore, I think this is a browser related error.

+25
unicode
Aug 19 2018-11-11T00:
source share
4 answers

There are two problems: one in the output system (font renderer), which is not Thai, and one in the input system, which generated this text in the first place.

If you did your homework, you would know that mai tho and maitaikhu (UniCode names) are what UniCode is called Non Spacing (NSM) markers. This means that the font renderer should not move to the next cell of a character when that character is displayed.

To avoid the clutter you see above, the Thai API Consortium (TAPIC) made the WTT 2.0 standard, which describes how the font rendering algorithm should handle the Thai letter order when it receives it as an input, and also how the input method should allow enter these characters if you are trying to enter them.

Standardization and implementation of the Thai language review

libthai includes both input and output methods.

thaicheck is a small program that can detect letter sequence problems and fix them.

By the way, you cannot have the sequence (word) do dek, mai tho and maitaikhu; input sequence is noise.

Keep in mind that some editors have violated input methods that allow you to print multiple NSMs that cannot be combined, but the output method will only display legal sequences; the result is an invalid input line that looks normal to the user on his system.

+8
Aug 19 '11 at 10:19
source share

The codes you mentioned are all in UTF-8, so each character needs 3 bytes. Unicode Code of Respect:

The last two are in the Mark, Nonspacing and have the Combine ( Canonical_Combining_Class ) property equal to 107, which means that the code points are merged with the previous code point in the rendering.

For example, an example starts with a single character and adds a lot of uneven marks to it.

Compare with this C # code:

 char DODEK = (char)0x0e14; char MAITHO = (char)0x0e49; char MAITAIKHU = (char)0x0e47; string thai = new string(new char[] { DODEK, MAITHO, MAITAIKHU }); Console.WriteLine("number of code points: " + thai.Length); var si = new System.Globalization.StringInfo(thai); Console.WriteLine("number of text elements: " + si.LengthInTextElements); 

Output:

 number of code points: 3 number of text elements: 1 

See also . Class Net StringInfo .

+3
Aug 19 '11 at 10:18
source share

You should never combine hundreds of Unicode characters with one graphic character, although unicode formats technically allow this; you usually combine no more than 2 or 3 characters.

In Thai, you have vowels and tones that appear above the consonnant symbol (sometimes vowels appear below or even around consonant symbols ...). This is a bit like accents over vowels in French (& eacute ;, eg egveve ...) or umlauts in German. It is not normal to have more than two such characters in Thai (and more than one in French or German). This means that your input is illegal Thai text (perhaps written to provide some funny graphic effects, such as "ASCII art"). I am not surprised that such illegal text is interpreted differently depending on the browser.

+2
Feb 28 '14 at 12:25
source share

What you find is called a combination of characters or as a normal people called by Zalgo .

This works because Unicode allows you to combine characters by adding diacritics after the character .

Any system that uses Unicode will work with these characters.

+1
May 19 '16 at 11:50
source share



All Articles