Regex for Domain.CCTLD Compliance

Does anyone know a regex to match Domain.CCTLD? I do not want subdomains, only "atomic domain". For example, docs.google.com does not match, but google.com does. However, this is complicated by things like .co.uk , CCTLD. Does anyone know a solution? Thanks in advance.

EDIT: I realized that I also have to deal with several sub-areas, for example john.doe.google.co.uk . Now you need a solution more than ever: P.

+8
python regex subdomain tld dns
Jul 07 2018-10-22T00:
source share
4 answers

Based on your comment above, I'm going to rethink the question - instead of making a regular expression that matches them, we will create a function that will match them, and use this function to filter the domain list; names include only first-class domains, for example google .com, amazon.co.uk.

First we need a TLD list. As Greg noted, the list of state suffixes is a great place to start. Suppose you suffixes list into a python array called suffixes . If you don't like this, comment and I can add code that will do this.

 suffixes = parse_suffix_list("suffix_list.txt") 

Now we need a code that identifies whether a given domain name matches the some-name.suffix pattern:

 def is_domain(d): for suffix in suffixes: if d.endswith(suffix): # Get the base domain name without suffix base_name = d[0:-(suffix.length + 1)] # If it contains '.', it a subdomain. if not base_name.contains('.'): return true # If we get here, no matches were found return false 
+3
Jul 08 '10 at 21:41
source share

It looks like you are looking for information available through the Public Suffix List project.

An "open suffix" is one in which Internet users can register names directly. Some examples of public suffixes are ".com", ".co.uk" and "pvt.k12.wy.us". A public suffix list is a list of all known public suffixes.

There is no single regular expression that would reasonably match the list of public suffixes. You will need to implement the code in order to use the suffix list for sharing, or find an existing library that already does this.

+8
Jul 07 2018-10-10T00:
source share

I would probably solve this by getting a complete list of TLDs and using it to create a regular expression. For example (in Ruby, sorry, not Pythonista yet):

 tld_alternation = ['\.com','\.co\.uk','\.eu','\.org',...].join('|') regex = /^[a-z0-9]([a-z0-9\-]*[a-z0-9])?(#{tld_alternation})$/i 

I don’t think that you can correctly distinguish between a real two-way TLD and a subdomain without knowing the actual TLD list (i.e. you can always build a subdomain that looks like a TLD if you knew how the regular expression works.)

+2
Jul 07 2018-10-22T00:
source share
 ^[a-zA-Z0-9\-\.]+\.(com|org|net|mil|edu|COM|ORG|NET|MIL|EDU)$ 
-3
Jul 07 2018-10-10T00:
source share



All Articles