Canonize / normalize url?

I was looking for a library function to normalize the URL in Python, that is, remove the "./" or "../" parts in the path or add the default port or hide special characters, etc. The result should be a string that is unique to two URLs pointing to the same web page. For example, http://google.com and http://google.com:80/a/../ should return the same result.

I would prefer Python 3 and already look through the urllib module. It offers functions for separating URLs, but nothing allows them to be canonized. Java has a URI.normalize() function that does a similar thing (although it does not consider port 80 to be equal to any given port by default), but is there something like this python?

+9
url normalization
May 14 '12 at 14:02
source share
4 answers

Following a good start , I have compiled a method that is suitable for most cases commonly found on the Internet.

 def urlnorm(base, link=''): '''Normalizes an URL or a link relative to a base url. URLs that point to the same resource will return the same string.''' new = urlparse(urljoin(base, url).lower()) return urlunsplit(( new.scheme, (new.port == None) and (new.hostname + ":80") or new.netloc, new.path, new.query, '')) 
0
May 19 '12 at 13:29
source share

How about this:

 In [1]: from urllib.parse import urljoin In [2]: urljoin('http://example.com/a/b/c/../', '.') Out[2]: 'http://example.com/a/b/' 

Inspired by the answers to this question . It does not normalize ports, but it should just be a hack function that does.

+4
May 14 '12 at 16:34
source share

This is what I use, and it has worked so far. You can get urlnorm from pip.

Please note that I am sorting request parameters. I found this significant.

 from urlparse import urlsplit, urlunsplit, parse_qsl from urllib import urlencode import urlnorm def canonizeurl(url): split = urlsplit(urlnorm.norm(url)) path = split[2].split(' ')[0] while path.startswith('/..'): path = path[3:] while path.endswith('%20'): path = path[:-3] qs = urlencode(sorted(parse_qsl(split.query))) return urlunsplit((split.scheme, split.netloc, path, qs, '')) 
+4
Mar 26 '13 at 4:56
source share

The urltools module normalizes multiple slashes, components . and .. without spoiling the double slash http:// .

Once you run pip install urltools , the usage will be as follows:

 print urltools.normalize('http://domain.com:80/a////b/../c') >>> 'http://domain.com/a/c' 
+2
Jun 11 '16 at 17:01
source share



All Articles