If you think any netloc will be the same, you can analyze using urllib.parse
from urllib.parse import urlparse
What will give you:
www.myurlnumber1.com
So, to get unique netlocs, you can do something like:
unique = {urlparse(u).netloc for u in urls}
If you want to save the URL scheme:
urls = ["http://www.myurlnumber1.com/foo+%bar%baz%qux", "http://www.myurlnumber1.com"] unique = {"{}://{}".format(u.scheme, u.netloc) for u in map(urlparse, urls)} print(unique)
Suppose everyone has schemes and you do not have http and https for the same netloc and consider them the same.
If you also want to add a path:
unique = {u.netloc, u.path) for u in map(urlparse, urls)}
The attribute table is indicated in the documents:
Attribute Index Value Value if not present scheme 0 URL scheme specifier scheme parameter netloc 1 Network location part empty string path 2 Hierarchical path empty string params 3 Parameters for last path element empty string query 4 Query component empty string fragment 5 Fragment identifier empty string username User name None password Password None hostname Host name (lower case) None port Port number as integer, if present None
You just need to use everything that you consider to be unique pieces.
In [1]: from urllib.parse import urlparse In [2]: urls = ["http://www.url.com/foo-bar", "http://www.url.com/foo-bar?t=baz", "www.url.com/baz-qux", "www.url.com/foo-bar?t=baz"] In [3]: unique = {"".join((u.netloc, u.path)) for u in map(urlparse, urls)} In [4]: In [4]: print(unique) {'www.url.com/baz-qux', 'www.url.com/foo-bar'}