It’s good that it took me a while to understand, I got the impression that the combination of characters to produce the debt is limited to these . So I was expecting the next regular expression to catch freaks.
([\u0300–\u036F\u1AB0–\u1AFF\u1DC0–\u1DFF\u20D0–\u20FF\uFE20–\uFE2F]{2,})
and it didn’t work ...
The trick is that this list on the wiki does not cover the entire range of character combinations.
What gave me a hint is "ก้้้้้้้้้้้้้้้้้้้้".charCodeAt(2).toString(16) = “e49”, which is not in the combining range, it falls into “Private Use”.
In C #, they fall under UnicodeCategory.NonSpacingMark and the following script flushes them:
[Test] public void IsZalgo() { var zalgo = new[] { UnicodeCategory.NonSpacingMark }; File.Delete("IsModifyLike.html"); File.AppendAllText("IsModifyLike.html", "<table>"); for (var i = 0; i < 65535; i++) { var c = (char)i; if (zalgo.Contains(Char.GetUnicodeCategory(c))) { File.AppendAllText("IsModifyLike.html", string.Format("<tr><td>{0}</td><td>{1}</td><td>{2}</td><td>A&#{3};&#{3};&#{3}</td></tr>\n", i.ToString("X"), c, Char.GetUnicodeCategory(c), i)); } } File.AppendAllText("IsModifyLike.html", "</table>"); }
By looking at the created table, you can see which of them are flocking. One 06D6-06DC missing from the wiki, - 06D6-06DC another 0730-0749 .
UPDATE:
Here the regular expression is updated, which should catch all zalgo, including workarounds in the "normal" range.
([\u0300–\u036F\u1AB0–\u1AFF\u1DC0–\u1DFF\u20D0–\u20FF\uFE20–\uFE2F\u0483-\u0486\u05C7\u0610-\u061A\u0656-\u065F\u0670\u06D6-\u06ED\u0711\u0730-\u073F\u0743-\u074A\u0F18-\u0F19\u0F35\u0F37\u0F72-\u0F73\u0F7A-\u0F81\u0F84\u0e00-\u0eff\uFC5E-\uFC62]{2,})
The hardest bit is to identify them as soon as you have done this - many solutions, including some of them above.
Hope this saves you some time.