We are trying to split the domain name ( s ) into any number of words (not only 2) from the set of known words ( words ). Recursion ftw!
def substrings_in_set(s, words): if s in words: yield [s] for i in range(1, len(s)): if s[:i] not in words: continue for rest in substrings_in_set(s[i:], words): yield [s[:i]] + rest
This iterator function first returns the string with which it is called if it is in words . Then it breaks the string into two parts in every possible way. If the first part is not in words , she tries the next split. If so, the first part is added to all the results of calling yourself in the second part (which may be nothing, for example, in ["example", "cart", ...])
Then we create an English dictionary:
# Assuming Linux. Word list may also be at /usr/dict/words. # If not on Linux, grab yourself an enlish word list and insert here: words = set(x.strip().lower() for x in open("/usr/share/dict/words").readlines()) # The above english dictionary for some reason lists all single letters as words. # Remove all except "i" and "u" (remember a string is an iterable, which means # that set("abc") == set(["a", "b", "c"])). words -= set("bcdefghjklmnopqrstvwxyz") # If there are more words we don't like, we remove them like this: words -= set(("ex", "rs", "ra", "frobnicate")) # We may also add words that we do want to recognize. Now the domain name # slartibartfast4ever.co.uk will be properly counted, for instance. words |= set(("4", "2", "slartibartfast"))
Now we can put it all together:
count = {} no_match = [] domains = ["examplecartrading.com", "examplepensions.co.uk", "exampledeals.org", "examplesummeroffers.com"] # Assume domains is the list of domain names ["examplecartrading.com", ...] for domain in domains: # Extract the part in front of the first ".", and make it lower case name = domain.partition(".")[0].lower() found = set() for split in substrings_in_set(name, words): found |= set(split) for word in found: count[word] = count.get(word, 0) + 1 if not found: no_match.append(name) print count print "No match found for:", no_match
Result: {'ions': 1, 'pens': 1, 'summer': 1, 'car': 1, 'pensions': 1, 'deals': 1, 'offers': 1, 'trading': 1, 'example': 4}
Using set to store an English dictionary allows you to quickly verify membership. -= removes elements from the set, |= adds to it.
Using the all function together with a generator expression improves efficiency, since all returns the first False .
Some substrings can be a valid word, either integer or separated, for example, "example" / "ex" + "enough". In some cases, we can solve the problem by eliminating unnecessary words such as "ex" in the above code example. For others, for example, “pensions” / “pens” + “ions”, this can be inevitable, and when this happens, we need to prevent all other words in the line from being counted several times (once for “pensions” and one times for "pens" + "ions"). We do this by keeping track of the words found for each domain name in the set — set ignoring duplicates — and then counting the words when they are all found.
EDIT: Restructured and added a lot of comments. Forced lowercase strings to avoid omissions due to capitalization. A list has also been added to track domain names where there was no word combination.
INSUFFICIENCY: The substring function has been changed to scale better. The old version has become ridiculously slow for domain names longer than 16 characters or so. Using only the four domain names above, I improved my own run time from 3.6 seconds to 0.2 seconds!