How to check internationalized domain names

Question

How to check internationalized domain names

I want to check the domain URL in php, which can be in the format of an internationalized domain name, for example, in the Greek domain name = http: //παράδειγμα.δοκιμή Do they have any way to check it with a regular expression?

+7

php regex dns

user1969981 Jan 14 '13 at 5:52

source share

3 answers

masakielastic · Answer 1 · 2014-10-24T08:32:52+0000

If you want to create your own librarian, you need to use the table of allowed code points ( IANA - IDN method repository, IDN Character Checker Guide , IDNA Parameters ) and the Unicode Script properties table ( UNIDATA / Scripts.txt ).

Gmail accepts the Unicode "H Very Limited " consortia specification ( Gmail Protection in a Global World ). The following Unicode Scripts add-ons are allowed.

Single script
Latin + Khan + Hiragana + Katakana
Latin + Khan + Bopomofo
Latin + Khan + Hangul

You may need to pay a binding to the special values of the Script properties (Common, Inherited, Unknown), as some of the characters have several properties or incorrect properties.

For example, U + 3099 (COMBINING KATAKANA-HIRAGANA VOICED SOUND MARK) has two options (Katakana and Hiragana), and the PCRE function classifies it as Inherited. Another example is U + x2A708. Althogh right Script property U + 2A708 (combining U + 30C8 KATAKANA LETTER TO and U + 30E2 KATAKANA LETTER MO) is "Katakana", the Unicode specification incorrectly classifies it as "Khan".

You may need to consider identifying homogeneous IDN information . Google Chrome IDN policy accepts blacklist characters .

My recommendation is to use Zend \ Validator \ Hostname. This library uses a valid code point table for Japanese and Chinese.

If you use Symfony, try upgrading the application to version 2.5, which uses egulias / email-validatornd ( Guide ). You need additional testing to see if the string is a well-formed sequence of bytes. For more information, see My Report a>.

Do not forget that XSS and SQL injection. The following address is a valid RFC5322-based email address.

// From Japanese tutorial // http://blog.tokumaru.org/2013/11/xsssqlrfc5322.html "><script>alert('or/**/1=1#')</script>"@example.jp

I think it is doubtful to use idn_to_ascii for verification, since idn_to_ascii conveys almost all characters.

 for ($i = 0; $i < 0x110000; ++$i) { $c = utf8_chr($i); if ($c !== '' && false !== idn_to_ascii($c)) { $number = strtoupper(dechex($i)); $length = strlen($number); if ($i < 0x10000) { $number = str_repeat('0', 4 - $length).$number; } $idn = $c.'example.com'; echo 'U+'.$number.' '; echo ' '.$idn.' '. idn_to_ascii($idn); echo PHP_EOL; } } function utf8_chr($code_point) { if ($code_point < 0 || 0x10FFFF < $code_point || (0xD800 <= $code_point && $code_point <= 0xDFFF)) { return ''; } if ($code_point < 0x80) { $hex[0] = $code_point; $ret = chr($hex[0]); } else if ($code_point < 0x800) { $hex[0] = 0x1C0 | $code_point >> 6; $hex[1] = 0x80 | $code_point & 0x3F; $ret = chr($hex[0]).chr($hex[1]); } else if ($code_point < 0x10000) { $hex[0] = 0xE0 | $code_point >> 12; $hex[1] = 0x80 | $code_point >> 6 & 0x3F; $hex[2] = 0x80 | $code_point & 0x3F; $ret = chr($hex[0]).chr($hex[1]).chr($hex[2]); } else { $hex[0] = 0xF0 | $code_point >> 18; $hex[1] = 0x80 | $code_point >> 12 & 0x3F; $hex[2] = 0x80 | $code_point >> 6 & 0x3F; $hex[3] = 0x80 | $code_point & 0x3F; $ret = chr($hex[0]).chr($hex[1]).chr($hex[2]).chr($hex[3]); } return $ret; }

If you want to check the domain for Unicode Script properties, use the PCRE functions.

The following code shows how to get the name of a Unicode Script property. If you want to use Unicode Script peroperties in JavaScript, use mathiasbynens / unicode-data .

 function get_unicode_script_name($c) { // http://php.net/manual/regexp.reference.unicode.php $names = [ 'Arabic', 'Armenian', 'Avestan', 'Balinese', 'Bamum', 'Batak', 'Bengali', 'Bopomofo', 'Brahmi', 'Braille', 'Buginese', 'Buhid', 'Canadian_Aboriginal', 'Carian', 'Chakma', 'Cham', 'Cherokee', 'Common', 'Coptic', 'Cuneiform', 'Cypriot', 'Cyrillic', 'Deseret', 'Devanagari', 'Egyptian_Hieroglyphs', 'Ethiopic', 'Georgian', 'Glagolitic', 'Gothic', 'Greek', 'Gujarati', 'Gurmukhi', 'Han', 'Hangul', 'Hanunoo', 'Hebrew', 'Hiragana', 'Imperial_Aramaic', 'Inherited', 'Inscriptional_Pahlavi', 'Inscriptional_Parthian', 'Javanese', 'Kaithi', 'Kannada', 'Katakana', 'Kayah_Li', 'Kharoshthi', 'Khmer', 'Lao', 'Latin', 'Lepcha', 'Limbu', 'Linear_B', 'Lisu', 'Lycian', 'Lydian', 'Malayalam', 'Mandaic', 'Meetei_Mayek', 'Meroitic_Cursive', 'Meroitic_Hieroglyphs', 'Miao', 'Mongolian', 'Myanmar', 'New_Tai_Lue', 'Nko', 'Ogham', 'Old_Italic', 'Old_Persian', 'Old_South_Arabian', 'Old_Turkic', 'Ol_Chiki', 'Oriya', 'Osmanya', 'Phags_Pa', 'Phoenician', 'Rejang', 'Runic', 'Samaritan', 'Saurashtra', 'Sharada', 'Shavian', 'Sinhala', 'Sora_Sompeng', 'Sundanese', 'Syloti_Nagri', 'Syriac', 'Tagalog', 'Tagbanwa', 'Tai_Le', 'Tai_Tham', 'Tai_Viet', 'Takri', 'Tamil', 'Telugu', 'Thaana', 'Thai', 'Tibetan', 'Tifinagh', 'Ugaritic', 'Vai', 'Yi' ]; $ret = []; foreach ($names as $name) { $pattern = '/\p{'.$name.'}/u'; if (preg_match($pattern, $c)) { return $name; } } return ''; }

Greenover · Answer 2 · 2013-01-14T06:04:13+0000

These are idn domains, I would first convert it to puny code and check the domains.

But if you really like to check for regular expression

 <?php $domain = 'παράδειγμα.gr'; $regex = '#^([\w-]+://?|www[\.])?([^\-\s\,\;\:\+\/\\\?\^\`\=\&\%\"\'\*\#\<\>]*)\.[az]{2,7}$#'; if (preg_match($regex, $domain)) { echo "VALID"; }

But this allows you to run false opportunities, because it is very difficult to check the idn domain, which I tried to check that invalid characters are not included, but the list is NOT complete.

Better convert bevore to punny code

 $regex = '#^([\w-]+://?|www[\.])?[a-z0-9]+[a-z0-9\-\.]*[a-z0-9]+\.[az]{2,7}$#'; if (preg_match($regex, idn_to_ascii($domain))) { echo "VALID"; }

And if you want to check if the domain can be resolved, try:

 $regex = '#^([\w-]+://?|www[\.])?[a-z0-9]+[a-z0-9\-\.]*[a-z0-9]+\.[az]{2,7}$#'; $punny_domain = idn_to_ascii($domain); if (preg_match($regex, $punny_domain)) { if (gethostbyname($punny_domain) != $punny_domain) { echo "VALID"; } }

Michel feldheim · Answer 3 · 2013-01-15T08:30:48+0000

This is the so-called IDN domain . Clients that support IDNs will normalize it using the IDNA2008 standard as specified in RFC 5890 , then replace the remaining unicode characters using Punycode , as defined in RFC 3492, before sending to resolve DNS.

By specification, literally every character in the UTF-8 character set is valid for use in the IDN domain, but each top-level domain authority can determine valid Unicode characters, so it will be difficult to create and maintain a real regular expression .

If you want to accept IDNs in your application, you must work internally with the encoded version. PHP extension intl provides two functions for en- and decoding IDN domain names

 echo idn_to_ascii('täst.de');

xn--tst-qla.de

After coding in the domain, any traditional regular expression will pass

Simple check:

 $url = "http://example.com/"; if (preg_match('/^(http|https|ftp):\/\/([A-Z0-9][A-Z0-9_-]*(?:\.[A-Z0-9][A-Z0-9_-]*)+):?(\d+)?\/?/i', $url)) { echo 'OK'; } else { echo 'Invalid URL.'; }

EDIT:

If you want real DNS verification, you can use dns_get_record (PHP 5) or gethostbyaddr

eg.

 $domain = 'ελληνικά.idn.icann.org'; $idnDomain = idn_to_ascii( $domain ); if ( $dnsResult = dns_get_record( $idnDomain, DNS_ANY ) ) { echo $idnDomain , "\n"; print_r( $dnsResult ); } else { echo "failed to lookup domain\n"; }

Result:

 xn--hxargifdar.idn.icann.org Array ( [0] => Array ( [host] => xn--hxargifdar.idn.icann.org [class] => IN [ttl] => 21456 [type] => A [ip] => 199.7.85.10 ) [1] => Array ( [host] => xn--hxargifdar.idn.icann.org [class] => IN [ttl] => 21600 [type] => AAAA [ipv6] => 2620::2830:230:0:0:0:10 ) )

How to check internationalized domain names

More articles: