Recover URL using Python

There is a large file. Each line of this file is a URL entered by a person, so there can be different problems such as missing http missing www , etc.

Is there a Python module that can recover these urls? I tried url_fix from werkzeug.urls , but that is not quite what I am looking for.

 www.example.com >> http://www.example.com/ 

Of course, there can be no method that can fix any possible error, but I'm looking for a fix for the most common errors.

Do you have any tips?

EDIT: According to Peter Wood's comment, suppose the URL should contain www . In my case, these are eshop urls.

+7
python url urlparse
source share
1 answer

You can parse urls with urlparse , providing a default scheme, and then recombine:

 >>> from urlparse import urlparse, urlunparse >>> urlunparse(urlparse('www.example.com', scheme='http')) 'http:///www.example.com' 

As I mentioned in my comment, the lack of www not necessarily a mistake.

If you really insist on putting www in the foreground, then:

 def fix_url(url): parsed = urlparse('example.com', scheme='http') if parsed.netloc: if not parsed.netloc.startswith('www.'): parsed = parsed._replace(netloc='www.' + parsed.netloc) elif not parsed.path.startswith('www.'): parsed = parsed._replace(path='www.' + parsed.path) return str(parsed) >>> fix_url('http://example.com') 'http:///www.example.com' >>> fix_url('example.com') 'http:///www.example.com' 
+3
source share

All Articles