Python parsing - normalizing double slash in paths

I am working on an application that needs to parse URLs (mainly HTTP URLs) on HTML pages. I cannot control the input, and some of them, as expected, are a bit messy.

One of the problems I often run into is that urlparse is very strict (and maybe even a buggy?) When it comes to parsing and combining URLs with double slashes in part of the path, for example:

testUrl = 'http://www.example.com//path?foo=bar'
urlparse.urljoin(testUrl, 
                 urlparse.urlparse(testUrl).path)

Instead of the expected result http://www.example.com//path(or, even better, with a normalized single slash), I get http://path.

By the way, I run such code because it is the only way I found to remove part of the request / fragment from the URLs. There may be a better way to do this, but I could not find it.

Can anyone recommend a way to avoid this, or should I just normalize the path myself using a (relatively simple, I know) regular expression?

+5
source share
7 answers

If you want to get the URL without part of the request, I would skip the urlparse module and simply execute:

testUrl.rsplit('?')

The URL will have index 0 of the returned list and the request at index 1.

It is impossible to have two '?' in the url, so it should work for all urls.

+4
source

(//path) ,

http://tools.ietf.org/html/rfc3986.html#section-3.3

URI , ( "//" ).

, :

import re
import urlparse

testurl = 'http://www.example.com//path?foo=bar'

parsed = list(urlparse.urlparse(testurl))
parsed[2] = re.sub("/{2,}", "/", parsed[2]) # replace two or more / with one
cleaned = urlparse.urlunparse(parsed)

print cleaned
# http://www.example.com/path?foo=bar

print urlparse.urljoin(
    testurl, 
    urlparse.urlparse(cleaned).path)

# http://www.example.com//path

, , :

import re
import urlparse

testurl = 'http://www.example.com//path?foo=bar'
parsed = list(urlparse.urlparse(testurl))

newurl = ["" for i in range(6)] # could urlparse another address instead

# Copy first 3 values from
# ['http', 'www.example.com', '//path', '', 'foo=bar', '']
for i in range(3):
    newurl[i] = parsed[i]

# Rest are blank
for i in range(4, 6):
    newurl[i] = ''

print urlparse.urlunparse(newurl)
# http://www.example.com//path
+6

urlparse , :

URL- URL- ( , // ://), / URL-.

urljoin('http://www.cwi.nl/%7Eguido/Python.html',
...         '//www.python.org/%7Eguido')
'http://www.python.org/%7Eguido'

, URL urlsplit() urlunsplit(), netloc.

, :

urlparse.urljoin(testUrl,
             urlparse.urlparse(testUrl).path.replace('//','/'))

= 'http://www.example.com/path'

+2

?

urlparse.urlparse(testUrl).path.replace('//', '/')
0

:

def http_normalize_slashes(url):
    url = str(url)
    segments = url.split('/')
    correct_segments = []
    for segment in segments:
        if segment != '':
            correct_segments.append(segment)
    first_segment = str(correct_segments[0])
    if first_segment.find('http') == -1:
        correct_segments = ['http:'] + correct_segments
    correct_segments[0] = correct_segments[0] + '/'
    normalized_url = '/'.join(correct_segments)
    return normalized_url

URL-:

print(http_normalize_slashes('http://www.example.com//path?foo=bar'))
print(http_normalize_slashes('http:/www.example.com//path?foo=bar'))
print(http_normalize_slashes('www.example.com//x///c//v///path?foo=bar'))
print(http_normalize_slashes('http://////www.example.com//x///c//v///path?foo=bar'))

:

http://www.example.com/path?foo=bar
http://www.example.com/path?foo=bar
http://www.example.com/x/c/v/path?foo=bar
http://www.example.com/x/c/v/path?foo=bar

, ..:)

0

This answer seemed to give the best results when I tried to fix double slashes in paths without touching the initial double slash in the http: // bit.

here is the code:

from urlparse import urljoin
from functools import reduce


def slash_join(*args):
    return reduce(urljoin, args).rstrip("/")
0
source

I accepted my answers @yunhasnawa. here is the part:

import urllib2
from urlparse import urlparse, urlunparse

def sanitize_url(url):
    url_parsed = urlparse(url)  
    return urlunparse((url_parsed.scheme, url_parsed.netloc, avoid_double_slash(url_parsed.path), '', '', ''))

def avoid_double_slash(path):
  parts = path.split('/')
  not_empties = [part for part in parts if part]
  return '/'.join(not_empties)


>>> sanitize_url('https://hostname.doma.in:8443/complex-path////next//')
'https://hostname.doma.in:8443/complex-path/next'
0
source

All Articles