Canonical URL in Python?

Question

Canonical URL in Python?

Are there any tools for comparing URLs in Python?

For example, if I have http://google.com and google.com/ , I would like to know that they are likely to be the same site.

If I were to create the rule manually, I could transfer it and then disable the http:// and discard something after the last alphanumeric character. But I see errors of this, since of course I can also.

Is there a library that does this? How do you do this?

+7

python fuzzy-comparison

Colin davis Jul 19 '10 at 21:36

source share

3 answers

Ned batchelder · Answer 1 · 2010-07-19T22:01:08+0000

This is from the head:

 def canonical_url(u): u = u.lower() if u.startswith("http://"): u = u[7:] if u.startswith("www."): u = u[4:] if u.endswith("/"): u = u[:-1] return u def same_urls(u1, u2): return canonical_url(u1) == canonical_url(u2)

Obviously, there are many opportunities for more messing around with this. Regexes may be better than startswith and endswith, but you get the idea.

Martlark · Answer 2 · 2010-07-20T00:58:31+0000

You can search for names using dns and see if they point to the same ip. Removing confused characters may require some small string processing.

 from socket import gethostbyname_ex urls = ['http://google.com','google.com/','www.google.com/','news.google.com'] data = [] for orginalName in urls: print 'url:',orginalName name = orginalName.strip() name = name.replace( 'http://','') name = name.replace( 'http:','') if name.find('/') > 0: name = name[:name.find('/')] if name.find('\\') > 0: name = name[:name.find('\\')] print 'dns lookup:', name if name: try: result = gethostbyname_ex(name) except: continue # Unable to resolve for ip in result[2]: print 'ip:', ip data.append( (ip, orginalName) ) print data

result:

 url: http://google.com dns lookup: google.com ip: 66.102.11.104 url: google.com/ dns lookup: google.com ip: 66.102.11.104 url: www.google.com/ dns lookup: www.google.com ip: 66.102.11.104 url: news.google.com dns lookup: news.google.com ip: 66.102.11.104 [('66.102.11.104', 'http://google.com'), ('66.102.11.104', 'google.com/'), ('66.102.11.104', 'www.google.com/'), ('66.102.11.104', 'news.google.com')]

R. hill · Answer 3 · 2010-07-19T21:47:10+0000

It is not "fuzzy", it just finds the "distance" between two lines:

http://pypi.python.org/pypi/python-Levenshtein/

I would remove all parts that are semantically meaningful for parsing URLs (protocol, slashes, etc.), normalize to lowercase, then perform the Levenstein distance, and then determine how much difference is an acceptable threshold.

Just an idea.

Canonical URL in Python?

More articles: