I am working on an application that needs to parse URLs (mainly HTTP URLs) on HTML pages. I cannot control the input, and some of them, as expected, are a bit messy.
One of the problems I often run into is that urlparse is very strict (and maybe even a buggy?) When it comes to parsing and combining URLs with double slashes in part of the path, for example:
testUrl = 'http://www.example.com//path?foo=bar'
urlparse.urljoin(testUrl,
urlparse.urlparse(testUrl).path)
Instead of the expected result http://www.example.com//path(or, even better, with a normalized single slash), I get http://path.
By the way, I run such code because it is the only way I found to remove part of the request / fragment from the URLs. There may be a better way to do this, but I could not find it.
Can anyone recommend a way to avoid this, or should I just normalize the path myself using a (relatively simple, I know) regular expression?
source
share