If you want to create your own librarian, you need to use the table of allowed code points ( IANA - IDN method repository, IDN Character Checker Guide , IDNA Parameters ) and the Unicode Script properties table ( UNIDATA / Scripts.txt ).
Gmail accepts the Unicode "H Very Limited " consortia specification ( Gmail Protection in a Global World ). The following Unicode Scripts add-ons are allowed.
- Single script
- Latin + Khan + Hiragana + Katakana
- Latin + Khan + Bopomofo
- Latin + Khan + Hangul
You may need to pay a binding to the special values of the Script properties (Common, Inherited, Unknown), as some of the characters have several properties or incorrect properties.
For example, U + 3099 (COMBINING KATAKANA-HIRAGANA VOICED SOUND MARK) has two options (Katakana and Hiragana), and the PCRE function classifies it as Inherited. Another example is U + x2A708. Althogh right Script property U + 2A708 (combining U + 30C8 KATAKANA LETTER TO and U + 30E2 KATAKANA LETTER MO) is "Katakana", the Unicode specification incorrectly classifies it as "Khan".
You may need to consider identifying homogeneous IDN information . Google Chrome IDN policy accepts blacklist characters .
My recommendation is to use Zend \ Validator \ Hostname. This library uses a valid code point table for Japanese and Chinese.
If you use Symfony, try upgrading the application to version 2.5, which uses egulias / email-validatornd ( Guide ). You need additional testing to see if the string is a well-formed sequence of bytes. For more information, see My Report a>.
Do not forget that XSS and SQL injection. The following address is a valid RFC5322-based email address.
// From Japanese tutorial // http://blog.tokumaru.org/2013/11/xsssqlrfc5322.html "><script>alert('or/**/1=1#')</script>"@example.jp
I think it is doubtful to use idn_to_ascii for verification, since idn_to_ascii conveys almost all characters.
for ($i = 0; $i < 0x110000; ++$i) { $c = utf8_chr($i); if ($c !== '' && false !== idn_to_ascii($c)) { $number = strtoupper(dechex($i)); $length = strlen($number); if ($i < 0x10000) { $number = str_repeat('0', 4 - $length).$number; } $idn = $c.'example.com'; echo 'U+'.$number.' '; echo ' '.$idn.' '. idn_to_ascii($idn); echo PHP_EOL; } } function utf8_chr($code_point) { if ($code_point < 0 || 0x10FFFF < $code_point || (0xD800 <= $code_point && $code_point <= 0xDFFF)) { return ''; } if ($code_point < 0x80) { $hex[0] = $code_point; $ret = chr($hex[0]); } else if ($code_point < 0x800) { $hex[0] = 0x1C0 | $code_point >> 6; $hex[1] = 0x80 | $code_point & 0x3F; $ret = chr($hex[0]).chr($hex[1]); } else if ($code_point < 0x10000) { $hex[0] = 0xE0 | $code_point >> 12; $hex[1] = 0x80 | $code_point >> 6 & 0x3F; $hex[2] = 0x80 | $code_point & 0x3F; $ret = chr($hex[0]).chr($hex[1]).chr($hex[2]); } else { $hex[0] = 0xF0 | $code_point >> 18; $hex[1] = 0x80 | $code_point >> 12 & 0x3F; $hex[2] = 0x80 | $code_point >> 6 & 0x3F; $hex[3] = 0x80 | $code_point & 0x3F; $ret = chr($hex[0]).chr($hex[1]).chr($hex[2]).chr($hex[3]); } return $ret; }
If you want to check the domain for Unicode Script properties, use the PCRE functions.
The following code shows how to get the name of a Unicode Script property. If you want to use Unicode Script peroperties in JavaScript, use mathiasbynens / unicode-data .
function get_unicode_script_name($c) { // http://php.net/manual/regexp.reference.unicode.php $names = [ 'Arabic', 'Armenian', 'Avestan', 'Balinese', 'Bamum', 'Batak', 'Bengali', 'Bopomofo', 'Brahmi', 'Braille', 'Buginese', 'Buhid', 'Canadian_Aboriginal', 'Carian', 'Chakma', 'Cham', 'Cherokee', 'Common', 'Coptic', 'Cuneiform', 'Cypriot', 'Cyrillic', 'Deseret', 'Devanagari', 'Egyptian_Hieroglyphs', 'Ethiopic', 'Georgian', 'Glagolitic', 'Gothic', 'Greek', 'Gujarati', 'Gurmukhi', 'Han', 'Hangul', 'Hanunoo', 'Hebrew', 'Hiragana', 'Imperial_Aramaic', 'Inherited', 'Inscriptional_Pahlavi', 'Inscriptional_Parthian', 'Javanese', 'Kaithi', 'Kannada', 'Katakana', 'Kayah_Li', 'Kharoshthi', 'Khmer', 'Lao', 'Latin', 'Lepcha', 'Limbu', 'Linear_B', 'Lisu', 'Lycian', 'Lydian', 'Malayalam', 'Mandaic', 'Meetei_Mayek', 'Meroitic_Cursive', 'Meroitic_Hieroglyphs', 'Miao', 'Mongolian', 'Myanmar', 'New_Tai_Lue', 'Nko', 'Ogham', 'Old_Italic', 'Old_Persian', 'Old_South_Arabian', 'Old_Turkic', 'Ol_Chiki', 'Oriya', 'Osmanya', 'Phags_Pa', 'Phoenician', 'Rejang', 'Runic', 'Samaritan', 'Saurashtra', 'Sharada', 'Shavian', 'Sinhala', 'Sora_Sompeng', 'Sundanese', 'Syloti_Nagri', 'Syriac', 'Tagalog', 'Tagbanwa', 'Tai_Le', 'Tai_Tham', 'Tai_Viet', 'Takri', 'Tamil', 'Telugu', 'Thaana', 'Thai', 'Tibetan', 'Tifinagh', 'Ugaritic', 'Vai', 'Yi' ]; $ret = []; foreach ($names as $name) { $pattern = '/\p{'.$name.'}/u'; if (preg_match($pattern, $c)) { return $name; } } return ''; }