What does Unicode combine characters with and how can we filter them?

ก ิิิิิิิิิิิิิิิิิิิิ ก ้้้้้้้้้้้้้้้้้้้้ ก ็็็็็็็็็็็็็็็็็็็็ ก ็็็็็็็็็็็็็็็็็็็็ ก ิิิิิิิิิิิิิิิิิิิิ ก ้้้้้้้้้้้้้้้้้้้้ ก ็็็็็็็็็็็็็็็็็็็็ ก ิิิิิิิิิิิิิิิิิิิิ ก ้้้้้้้้้้้้้้้้้้้้ ก ิิิิิิิิิิิิิิิิิิิิ ก ้้้้้้้้้้้้้้้้้้้้ ก ็็็็็็็็็็็็็็็็็็็็ ก ็็็็็็็็็็็็็็็็็็็็ ก ิิิิิิิิิิิิิิิิิิิิ ก ้้้้้้้้้้้้้้้้้้้้ ก ็็็็็็็ ็็็็็็็็็็็็็ ก ิิิิิิิิิิิิิิิิิิิิ ก ้้้้้้้้้้้้้้้้้้้้

They recently appeared in the comment sections on facebook.

How can we do this?

+88
unicode sanitize combining-marks zalgo
May 02 '12 at 13:34
source share
4 answers

What happens to these Unicode characters?

This character with a series combines characters . Because combined characters that want to go above the base character add up (literally). For example, the case

ก ้้้้้้้้้้้้้้้้้้้้

... this is ก (Thai character ko kai) ( U + 0E01 ), and then 20 copies of the Thai combining character mai tho ( U + 0E49 ).

How can we do this?

You can pre-process the text and limit the number of character combinations that can be applied to a single character, but this may not be worth the reward. You will need data sheets for all current characters, so that you know whether they were combined or what, and you must be sure to allow at least several, because some languages ​​are written with several diacritics on the same base, Now, if you want to limit comments with a Latin character set, this will be a simpler range check, but of course, this is only an option if you want to limit comments to only a few languages. Additional information, code sheets, etc. On unicode.org .

By the way, if you ever want to find out how a character was composed, for another question I recently encoded a quick and dirty Unicode Show Me page on JSBin. You simply copy and paste the text into the text area, and it shows all the code points (~ characters) that the text consists of, with links, such as above, to the page describing each character. It only works for code points in the range U + FFFF and lower, because it is written in JavaScript and to handle characters above U + FFFF in JavaScript you need to do more work than I wanted to do for this question (because in JavaScript "character "there is always 16 bits, which means that for some languages ​​a character can be divided into two separate JavaScript" characters ", and I did not take this into account), but it is convenient for most texts ...

+79
May 02 '12 at 13:42
source share

If you have a regex engine with decent Unicode support, it's trivial to sanitize such strings. For example, in Perl, you can remove everything except the first combination from each (user-perceived) character, for example:

#!/usr/bin/perl use strict; use utf8; binmode(STDOUT, ':utf8'); my $string = "กิิ ก้้ ก็็ ก็็ กิิ ก้้ ก็็ กิิ ก้้ กิิ ก้้ ก็็ ก็็ กิิ ก้้ ก็็ กิิ ก้้"; $string =~ s/(\p{Mark})\p{Mark}+/$1/g; # Strip excess combining marks print("$string\n"); 

This will print:

กิ ก้ ก็ ก็ กิ ก้ ก็ กิ ก้ กิ ก้ ก็ ก็ ก้ ก้ กิ ก้

+17
Mar 12 '13 at 18:33
source share

"How can we sanitize it," the best answer is TJ Crowder

However, I believe that sanitation is the wrong approach, and Cristy has the right with overflow:hidden on an element containing css.

At least the way I solve it.

+11
Mar 12 '13 at 18:00
source share

It’s good that it took me a while to understand, I got the impression that the combination of characters to produce the debt is limited to these . So I was expecting the next regular expression to catch freaks.

 ([\u0300–\u036F\u1AB0–\u1AFF\u1DC0–\u1DFF\u20D0–\u20FF\uFE20–\uFE2F]{2,}) 

and it didn’t work ...

The trick is that this list on the wiki does not cover the entire range of character combinations.

What gave me a hint is "ก้้้้้้้้้้้้้้้้้้้้".charCodeAt(2).toString(16) = “e49”, which is not in the combining range, it falls into “Private Use”.

In C #, they fall under UnicodeCategory.NonSpacingMark and the following script flushes them:

  [Test] public void IsZalgo() { var zalgo = new[] { UnicodeCategory.NonSpacingMark }; File.Delete("IsModifyLike.html"); File.AppendAllText("IsModifyLike.html", "<table>"); for (var i = 0; i < 65535; i++) { var c = (char)i; if (zalgo.Contains(Char.GetUnicodeCategory(c))) { File.AppendAllText("IsModifyLike.html", string.Format("<tr><td>{0}</td><td>{1}</td><td>{2}</td><td>A&#{3};&#{3};&#{3}</td></tr>\n", i.ToString("X"), c, Char.GetUnicodeCategory(c), i)); } } File.AppendAllText("IsModifyLike.html", "</table>"); } 

By looking at the created table, you can see which of them are flocking. One 06D6-06DC missing from the wiki, - 06D6-06DC another 0730-0749 .

UPDATE:

Here the regular expression is updated, which should catch all zalgo, including workarounds in the "normal" range.

 ([\u0300–\u036F\u1AB0–\u1AFF\u1DC0–\u1DFF\u20D0–\u20FF\uFE20–\uFE2F\u0483-\u0486\u05C7\u0610-\u061A\u0656-\u065F\u0670\u06D6-\u06ED\u0711\u0730-\u073F\u0743-\u074A\u0F18-\u0F19\u0F35\u0F37\u0F72-\u0F73\u0F7A-\u0F81\u0F84\u0e00-\u0eff\uFC5E-\uFC62]{2,}) 

The hardest bit is to identify them as soon as you have done this - many solutions, including some of them above.

Hope this saves you some time.

+6
Mar 17 '16 at 12:38 on
source share



All Articles