Python URL Distribution

I have a line like google.com in Python that I would like to split into two parts: google and .com . The problem is where I have the url like subdomain.google.com , which I would like to split into subdomain.google and .com .

How to separate the rest of the URL from the TLD? It cannot work based on the latter . in the url due to TLD e.g. .co.uk . Please note that the URL does not contain http: // or www.

+4
source share
2 answers

tldextract looks exactly as you need. He is dealing with the .co.uk problem.

+6
source

To do this, you need a list of valid domain names. Top level (.com, .org, etc.) And country codes (.us, .fr, etc.) Easy to find. Try http://www.icann.org/en/resources/registries/tlds .

For the second level (.co.uk, .org.au) you may need to find the code of each country in order to see its subdomains. Wikipedia is your friend.

Once you have the list, take the last two parts of the name that you have (google.com or co.uk) and see if it is on your second level list. If not, take the last part and see if it is on the top level list.

0
source

Source: https://habr.com/ru/post/1415821/


All Articles