Extract a 2nd level domain from a domain? - Python

I have a list of domains, for example

  • site.co.uk

  • site.com

  • site.me.uk

  • site.jpn.com

  • site.org.uk

  • site.it

domain names may also contain third and fourth level domains, for example

  • test.example.site.org.uk

  • test2.site.com

I need to try and extract the 2nd level domain, in all these cases the site


Any ideas? :)

+7
source share
6 answers

there is no way to reliably get this. Subdomains are arbitrary, and the list of domain extensions of monsters is growing every day. The best case is that you check the list of domain name monsters and save the list.

list: http://mxr.mozilla.org/mozilla-central/source/netwerk/dns/effective_tld_names.dat?raw=1

+8
source

The following @kohlehydrat suggestion:

 import urllib2 class TldMatcher(object): # use class vars for lazy loading MASTERURL = "http://mxr.mozilla.org/mozilla-central/source/netwerk/dns/effective_tld_names.dat?raw=1" TLDS = None @classmethod def loadTlds(cls, url=None): url = url or cls.MASTERURL # grab master list lines = urllib2.urlopen(url).readlines() # strip comments and blank lines lines = [ln for ln in (ln.strip() for ln in lines) if len(ln) and ln[:2]!='//'] cls.TLDS = set(lines) def __init__(self): if TldMatcher.TLDS is None: TldMatcher.loadTlds() def getTld(self, url): best_match = None chunks = url.split('.') for start in range(len(chunks)-1, -1, -1): test = '.'.join(chunks[start:]) startest = '.'.join(['*']+chunks[start+1:]) if test in TldMatcher.TLDS or startest in TldMatcher.TLDS: best_match = test return best_match def get2ld(self, url): urls = url.split('.') tlds = self.getTld(url).split('.') return urls[-1 - len(tlds)] def test_TldMatcher(): matcher = TldMatcher() test_urls = [ 'site.co.uk', 'site.com', 'site.me.uk', 'site.jpn.com', 'site.org.uk', 'site.it' ] errors = 0 for u in test_urls: res = matcher.get2ld(u) if res != 'site': print "Error: found '{0}', should be 'site'".format(res) errors += 1 if errors==0: print "Passed!" return (errors==0) 
+5
source

The problem is mixing the extracts of the 1st and 2nd levels.

A trivial decision ...

Make a list of possible site suffixes sorted by narrow or common case. "co.uk", "uk", "co.jp", "jp", "com"

And check if the suffix can be matched at the end of the domain. if appropriate, the next part is the site.

+3
source

The only possible way is through a list with all top-level domains (here, for example, .com or co.uk). Then you look at this list and check. I see no other way, at least without access to the Internet at runtime.

+2
source

Using python tld

https://pypi.python.org/pypi/tld

$ pip install tld

 from tld import get_tld print get_tld("http://www.google.co.uk") 'google.co.uk' 
+2
source

@Hugh Bothwell

In your example, you are not dealing with special domains such as Parliament.uk, they are represented in the file using "!" (e.g.! Parliament.uk)

I made some changes to your code and also made it look more like my PHP function that I used before.

Also added the ability to load data from a local file.

We also tested it with some domains, such as:

  • niki.bg, niki.1.bg
  • parliament.uk
  • niki.at, niki.co.at
  • niki.us, niki.ny.us
  • niki.museum, niki.national.museum
  • www.niki.uk - due to the "*" in the Mozilla file, this is reported as "OK."

Feel free to contact me @github, so I can add you as a collaborator.

The GitHub repository is here:

https://github.com/nmmmnu/TLDExtractor/blob/master/TLDExtractor.py

+1
source

All Articles