Validating an item in a substring list

I have a list of urls ( unicode ) and there are many repetitions. For example, the URLs http://www.myurlnumber1.com and http://www.myurlnumber1.com/foo+%bar%baz%qux lead to the same place.

Therefore, I need to weed out all of these duplicates.

My first idea was to check if there is a substring of an element in the list, for example:

 for url in list: if url[:30] not in list: print(url) 

However, it tries to pass the literal url[:30] to the list item and, obviously, returns all of them, since there is no element that exactly matches url[:30] .

Is there an easy way to solve this problem?

EDIT:

Often the host and path in the URL remain unchanged, but the parameters are different. For my purposes, a URL with the same name and host, but with different parameters, is still the same URL and is a duplicate.

+5
source share
2 answers

If you think any netloc will be the same, you can analyze using urllib.parse

 from urllib.parse import urlparse # python2 from urlparse import urlparse u = "http://www.myurlnumber1.com/foo+%bar%baz%qux" print(urlparse(u).netloc) 

What will give you:

 www.myurlnumber1.com 

So, to get unique netlocs, you can do something like:

 unique = {urlparse(u).netloc for u in urls} 

If you want to save the URL scheme:

 urls = ["http://www.myurlnumber1.com/foo+%bar%baz%qux", "http://www.myurlnumber1.com"] unique = {"{}://{}".format(u.scheme, u.netloc) for u in map(urlparse, urls)} print(unique) 

Suppose everyone has schemes and you do not have http and https for the same netloc and consider them the same.

If you also want to add a path:

 unique = {u.netloc, u.path) for u in map(urlparse, urls)} 

The attribute table is indicated in the documents:

 Attribute Index Value Value if not present scheme 0 URL scheme specifier scheme parameter netloc 1 Network location part empty string path 2 Hierarchical path empty string params 3 Parameters for last path element empty string query 4 Query component empty string fragment 5 Fragment identifier empty string username User name None password Password None hostname Host name (lower case) None port Port number as integer, if present None 

You just need to use everything that you consider to be unique pieces.

 In [1]: from urllib.parse import urlparse In [2]: urls = ["http://www.url.com/foo-bar", "http://www.url.com/foo-bar?t=baz", "www.url.com/baz-qux", "www.url.com/foo-bar?t=baz"] In [3]: unique = {"".join((u.netloc, u.path)) for u in map(urlparse, urls)} In [4]: In [4]: print(unique) {'www.url.com/baz-qux', 'www.url.com/foo-bar'} 
+6
source

You can try adding another loop if everything is ok with it. Sort of:

 for url in list: for i in range(len(list)): if url[:30] not in list[i]: print(url) 

It will compare every word with every other word to check for identity. This is just an example, I'm sure you can make it more reliable.

0
source

All Articles