Fuzzy string match in large text in Python (url)

I have a list of company names, and I have a list of links to company names.

The ultimate goal is to examine the URL and find out how many companies are on the URL on my list.

Example URL: http://www.dmx.com/about/our-clients

Each URL will be structured differently, so I don’t have a good way to search for regular expressions and create separate lines for each company name.

I would like to create a for loop to search each company from a list on the entire contents of the URL. But it seems that Levenshtein is better for two smaller lines, as well as a short string and a large text.

Where should this newbie look like?

+4
source share
2 answers

It doesn't seem to me that you need some kind of "fuzzy" match. And I assume that when you say "url", you mean "webpage at the address that the URL points to." Just use the built-in Python substring search function:

>>> import urllib2 >>> webpage = urllib2.urlopen('http://www.dmx.com/about/our-clients') >>> webpage_text = webpage.read() >>> webpage.close() >>> for name in ['Caribou Coffee', 'Express', 'Sears']: ... if name in webpage_text: ... print name, "found!" ... Caribou Coffee found! Express found! >>> 

If you are concerned about the inconsistency of string uppercase letters, just convert all of it to uppercase.

 >>> webpage_text = webpage_text.upper() >>> for name in ['CARIBOU COFFEE', 'EXPRESS', 'SEARS']: ... if name in webpage_text: ... print name, 'found!' ... CARIBOU COFFEE found! EXPRESS found! 
+5
source

I would add to answer senderle that it might make sense to somehow normalize your names (for example, remove all special characters and then apply it to webpage_text and your list of strings.

 def normalize_str(some_str): some_str = some_str.lower() for c in """-?'"/{}[]()&!,.`""": some_str = some_str.replace(c,"") return some_str 

If this is not enough, you can go into difflib and do something like:

 for client in normalized_client_names: closest_client = difflib.get_closest_match(client_name, webpage_text,1,0.8) if len(closest_client) > 0: print client_name, "found as", closest_client[0] 

The random clipping that I selected (Ratcliff / Obershelp) equal to 0.8 may be too soft or hard; play a little with him.

+3
source

All Articles