Canonize / normalize url?

Question

Canonize / normalize url?

I was looking for a library function to normalize the URL in Python, that is, remove the "./" or "../" parts in the path or add the default port or hide special characters, etc. The result should be a string that is unique to two URLs pointing to the same web page. For example, http://google.com and http://google.com:80/a/../ should return the same result.

I would prefer Python 3 and already look through the urllib module. It offers functions for separating URLs, but nothing allows them to be canonized. Java has a URI.normalize() function that does a similar thing (although it does not consider port 80 to be equal to any given port by default), but is there something like this python?

+9

url python-3.x normalization

XZS May 14 '12 at 14:02

source share

4 answers

How about this:

 In [1]: from urllib.parse import urljoin In [2]: urljoin('http://example.com/a/b/c/../', '.') Out[2]: 'http://example.com/a/b/'

Inspired by the answers to this question . It does not normalize ports, but it should just be a hack function that does.

+4

Thomas K May 14 '12 at 16:34

source share

This is what I use, and it has worked so far. You can get urlnorm from pip.

Please note that I am sorting request parameters. I found this significant.

 from urlparse import urlsplit, urlunsplit, parse_qsl from urllib import urlencode import urlnorm def canonizeurl(url): split = urlsplit(urlnorm.norm(url)) path = split[2].split(' ')[0] while path.startswith('/..'): path = path[3:] while path.endswith('%20'): path = path[:-3] qs = urlencode(sorted(parse_qsl(split.query))) return urlunsplit((split.scheme, split.netloc, path, qs, ''))

+4

stuckintheshuck Mar 26 '13 at 4:56

source share

The urltools module normalizes multiple slashes, components . and .. without spoiling the double slash http:// .

Once you run pip install urltools , the usage will be as follows:

 print urltools.normalize('http://domain.com:80/a////b/../c') >>> 'http://domain.com/a/c'

+2

ccpizza Jun 11 '16 at 17:01

source share

XZS · Accepted Answer · 2012-05-19 13:29

Following a good start , I have compiled a method that is suitable for most cases commonly found on the Internet.

 def urlnorm(base, link=''): '''Normalizes an URL or a link relative to a base url. URLs that point to the same resource will return the same string.''' new = urlparse(urljoin(base, url).lower()) return urlunsplit(( new.scheme, (new.port == None) and (new.hostname + ":80") or new.netloc, new.path, new.query, ''))

Canonize / normalize url?

More articles: