Extract a 2nd level domain from a domain? - Python

Question

Extract a 2nd level domain from a domain? - Python

I have a list of domains, for example

site.co.uk
site.com
site.me.uk
site.jpn.com
site.org.uk
site.it

domain names may also contain third and fourth level domains, for example

test.example.site.org.uk
test2.site.com

I need to try and extract the 2nd level domain, in all these cases the site

Any ideas? :)

+7

javascript jquery python html django

Radianthex Feb 06 '11 at 23:20

source share

6 answers

The following @kohlehydrat suggestion:

 import urllib2 class TldMatcher(object): # use class vars for lazy loading MASTERURL = "http://mxr.mozilla.org/mozilla-central/source/netwerk/dns/effective_tld_names.dat?raw=1" TLDS = None @classmethod def loadTlds(cls, url=None): url = url or cls.MASTERURL # grab master list lines = urllib2.urlopen(url).readlines() # strip comments and blank lines lines = [ln for ln in (ln.strip() for ln in lines) if len(ln) and ln[:2]!='//'] cls.TLDS = set(lines) def __init__(self): if TldMatcher.TLDS is None: TldMatcher.loadTlds() def getTld(self, url): best_match = None chunks = url.split('.') for start in range(len(chunks)-1, -1, -1): test = '.'.join(chunks[start:]) startest = '.'.join(['*']+chunks[start+1:]) if test in TldMatcher.TLDS or startest in TldMatcher.TLDS: best_match = test return best_match def get2ld(self, url): urls = url.split('.') tlds = self.getTld(url).split('.') return urls[-1 - len(tlds)] def test_TldMatcher(): matcher = TldMatcher() test_urls = [ 'site.co.uk', 'site.com', 'site.me.uk', 'site.jpn.com', 'site.org.uk', 'site.it' ] errors = 0 for u in test_urls: res = matcher.get2ld(u) if res != 'site': print "Error: found '{0}', should be 'site'".format(res) errors += 1 if errors==0: print "Passed!" return (errors==0)

+5

Hugh bothwell Feb 07 '11 at 16:11

source share

The problem is mixing the extracts of the 1st and 2nd levels.

A trivial decision ...

Make a list of possible site suffixes sorted by narrow or common case. "co.uk", "uk", "co.jp", "jp", "com"

And check if the suffix can be matched at the end of the domain. if appropriate, the next part is the site.

+3

mmv-ru Feb 06 '11 at 23:29

source share

The only possible way is through a list with all top-level domains (here, for example, .com or co.uk). Then you look at this list and check. I see no other way, at least without access to the Internet at runtime.

+2

kohlehydrat Feb 06 '11 at 23:29

source share

Using python tld

https://pypi.python.org/pypi/tld

$ pip install tld

 from tld import get_tld print get_tld("http://www.google.co.uk") 'google.co.uk'

+2

Artur barseghyan May 18, '13 at 11:31

source share

@Hugh Bothwell

In your example, you are not dealing with special domains such as Parliament.uk, they are represented in the file using "!" (e.g.! Parliament.uk)

I made some changes to your code and also made it look more like my PHP function that I used before.

Also added the ability to load data from a local file.

We also tested it with some domains, such as:

niki.bg, niki.1.bg
parliament.uk
niki.at, niki.co.at
niki.us, niki.ny.us
niki.museum, niki.national.museum
www.niki.uk - due to the "*" in the Mozilla file, this is reported as "OK."

Feel free to contact me @github, so I can add you as a collaborator.

The GitHub repository is here:

https://github.com/nmmmnu/TLDExtractor/blob/master/TLDExtractor.py

+1

Nick Apr 15 '13 at 19:31

source share

Crayon violent · Accepted Answer · 2011-02-06T23:30:39+0000

there is no way to reliably get this. Subdomains are arbitrary, and the list of domain extensions of monsters is growing every day. The best case is that you check the list of domain name monsters and save the list.

list: http://mxr.mozilla.org/mozilla-central/source/netwerk/dns/effective_tld_names.dat?raw=1

Extract a 2nd level domain from a domain? - Python

More articles: