Python: how to allow URLs containing ".."

I need to uniquely identify and store some URLs. The problem is that sometimes they contain ".." like http://somedomain.com/foo/bar/../../some/url , which is basically http://somedomain.com/some/url , if I'm not mistaken.

Is there a Python function or a sophisticated way to resolve these urls?

+6
python url
Nov 30 '10 at 18:41
source share
6 answers

Theres a simple solution using urlparse .urljoin:

 >>> import urlparse >>> urlparse.urljoin('http://www.example.com/foo/bar/../../baz/bux/', '.') 'http://www.example.com/baz/bux/' 

However, if there is no trailing slash (the last component is a file, not a directory), the last component will be deleted.

This fix uses the urlparse function to fetch the path, and then uses the (posixpath version) os.path to normalize the components. Confirm the cryptic issue with trailing slashes , then append the URL together. The following is the doctest :

 import urlparse import posixpath def resolveComponents(url): """ >>> resolveComponents('http://www.example.com/foo/bar/../../baz/bux/') 'http://www.example.com/baz/bux/' >>> resolveComponents('http://www.example.com/some/path/../file.ext') 'http://www.example.com/some/file.ext' """ parsed = urlparse.urlparse(url) new_path = posixpath.normpath(parsed.path) if parsed.path.endswith('/'): # Compensate for issue1707768 new_path += '/' cleaned = parsed._replace(path=new_path) return cleaned.geturl() 
+10
Nov 30 '10 at 19:04
source share
— -

This is the file path. Take a look at os.path.normpath :

 >>> import os >>> os.path.normpath('/foo/bar/../../some/url') '/some/url' 

EDIT:

If it is on Windows, your input path will use a backslash instead of a slash. In this case, you still need os.path.normpath to get rid of the patterns .. (and // and /./ and everything else is redundant), then convert the backslash to a slash:

 def fix_path_for_URL(path): result = os.path.normpath(path) if os.sep == '\\': result = result.replace('\\', '/') return result 

EDIT 2:

If you want to normalize URLs, do this (before disabling the method, etc.) using urlparse , as shown in the answer to this question .

EDIT 3:

It seems that urljoin not normalizing the base path it gave:

 >>> import urlparse >>> urlparse.urljoin('http://somedomain.com/foo/bar/../../some/url', '') 'http://somedomain.com/foo/bar/../../some/url' 

normpath in itself does not quite abbreviate it:

 >>> import os >>> os.path.normpath('http://somedomain.com/foo/bar/../../some/url') 'http:/somedomain.com/some/url' 

Note that the initial double slash has been eaten.

So, we must join forces:

 def fix_URL(urlstring): parts = list(urlparse.urlparse(urlstring)) parts[2] = os.path.normpath(parts[2].replace('/', os.sep)).replace(os.sep, '/') return urlparse.urlunparse(parts) 

Using:

 >>> fix_URL('http://somedomain.com/foo/bar/../../some/url') 'http://somedomain.com/some/url' 
+3
Nov 30 2018-10-18
source share

I wanted to comment on the resolveComponents function in the top answer.

Please note that if your path is / , the code will add another, which may be problematic. So I changed the IF condition to:

 if parsed.path.endswith( '/' ) and parsed.path != '/': 
+2
Nov 01 2018-12-11T00:
source share

urljoin will not work because it only allows point segments if the second argument is not absolute (!?) or empty. In addition, it does not handle excessive .. according to RFC 3986 (they should be removed; urljoin do not do this). posixpath.normpath cannot be used (much less than os.path.normpath) , since it allows multiple slashes in a line to only one (for example, ///// becomes / )), which is an incorrect behavior for the URL.




The following short function correctly resolves any line of the URL path. However, it should not be used with relative paths , since then it will be necessary to make additional decisions about its behavior (resurrect the error if excessive .. s? Delete . In the beginning? Leave them both?) - instead, attach the URLs to the resolution if you know you can handle relative paths. Without further ado:

 def resolve_url_path(path): segments = path.split('/') segments = [segment + '/' for segment in segments[:-1]] + [segments[-1]] resolved = [] for segment in segments: if segment in ('../', '..'): if resolved[1:]: resolved.pop() elif segment not in ('./', '.'): resolved.append(segment) return ''.join(resolved) 

This allows you to process trailing point segments (that is, without a trailing slash) and consecutive slashes. To resolve the entire URL, you can use the following shell (or just insert the path resolution function into it).

 try: # Python 3 from urllib.parse import urlsplit, urlunsplit except ImportError: # Python 2 from urlparse import urlsplit, urlunsplit def resolve_url(url): parts = list(urlsplit(url)) parts[2] = resolve_url_path(parts[2]) return urlunsplit(parts) 



Then you can call it like this:

 >>> resolve_url('http://example.com/../thing///wrong/../multiple-slashes-yeah/.') 'http://example.com/thing///multiple-slashes-yeah/' 

Proper URL resolution has more than a few pitfalls, it turns out!

+1
Nov 10 '16 at 20:06
source share

According to RFC 3986, this should happen as part of the relative resolution process. So the answer could be urlparse.urljoin(url, '') . But due to an error urlparse.urljoin does not delete dot segments when the second argument is an empty URL. You can use yurl , an alternative URL manipulation library. He does it right:

 >>> import yurl >>> print yurl.URL('http://somedomain.com/foo/bar/../../some/url') + yurl.URL() http://somedomain.com/some/url 
0
Dec 25 '13 at 15:48
source share
 import urlparse import posixpath parsed = list(urlparse.urlparse(url)) parsed[2] = posixpath.normpath(posixpath.join(parsed[2], rel_path)) proper_url = urlparse.urlunparse(parsed) 
0
Oct 17 '17 at 6:03
source share



All Articles