urljoin will not work because it only allows point segments if the second argument is not absolute (!?) or empty. In addition, it does not handle excessive .. according to RFC 3986 (they should be removed; urljoin do not do this). posixpath.normpath cannot be used (much less than os.path.normpath) , since it allows multiple slashes in a line to only one (for example, ///// becomes / )), which is an incorrect behavior for the URL.
The following short function correctly resolves any line of the URL path. However, it should not be used with relative paths , since then it will be necessary to make additional decisions about its behavior (resurrect the error if excessive .. s? Delete . In the beginning? Leave them both?) - instead, attach the URLs to the resolution if you know you can handle relative paths. Without further ado:
def resolve_url_path(path): segments = path.split('/') segments = [segment + '/' for segment in segments[:-1]] + [segments[-1]] resolved = [] for segment in segments: if segment in ('../', '..'): if resolved[1:]: resolved.pop() elif segment not in ('./', '.'): resolved.append(segment) return ''.join(resolved)
This allows you to process trailing point segments (that is, without a trailing slash) and consecutive slashes. To resolve the entire URL, you can use the following shell (or just insert the path resolution function into it).
try: # Python 3 from urllib.parse import urlsplit, urlunsplit except ImportError: # Python 2 from urlparse import urlsplit, urlunsplit def resolve_url(url): parts = list(urlsplit(url)) parts[2] = resolve_url_path(parts[2]) return urlunsplit(parts)
Then you can call it like this:
>>> resolve_url('http://example.com/../thing///wrong/../multiple-slashes-yeah/.') 'http://example.com/thing///multiple-slashes-yeah/'
Proper URL resolution has more than a few pitfalls, it turns out!
obskyr Nov 10 '16 at 20:06 2016-11-10 20:06
source share