Breaking a line into separate words (Python)

Question

Breaking a line into separate words (Python)

I have a large list of domain names (about six thousand), and I would like to see which words are the highest for a rough overview of our portfolio.

The problem I have is a list formatted as domain names, for example:

examplecartrading.com

examplepensions.co.uk

exampledeals.org

examplesummeroffers.com

+5996

Simple word counting causes garbage. So I think the easiest way to do this is to insert spaces between whole words and then start word counting.

For my sanity, I would prefer a script of this.

I know (very) small python 2.7, but I am open to any recommendations when approaching this, the sample code will really help. I was told that using a simple trie string data structure would be the easiest way to achieve this, but I don't know how to implement this in python.

Thanks!

Chris

+4

python string trie

Christopher long Aug 1 '11 at 10:35

source share

3 answers

 with open('/usr/share/dict/words') as f: words = [w.strip() for w in f.readlines()] def guess_split(word): result = [] for n in xrange(len(word)): if word[:n] in words and word[n:] in words: result = [word[:n], word[n:]] return result from collections import defaultdict word_counts = defaultdict(int) with open('blah.txt') as f: for line in f.readlines(): for word in line.strip().split('.'): if len(word) > 3: # junks the com , org, stuff for x in guess_split(word): word_counts[x] += 1 for spam in word_counts.items(): print '{word}: {count}'.format(word=spam[0],count=spam[1])

This uses brute force, which is only trying to split the domains into 2 English words. If the domain is not divided into 2 English words, it becomes inactive. It should be just to expand it, to try more splits, but it probably won't scale well with the number of partitions if you are not smart. Fortunately, I think you only need 3 or 4 split max.

output:

 deals: 1 example: 2 pensions: 1

+1

wim Aug 1 '11 at 10:46

source share

Assuming you have only a few thousand standard domains, you should be able to do this all in memory.

 domains=open(domainfile) dictionary=set(DictionaryFileOfEnglishLanguage.readlines()) found=[] for domain in domains.readlines(): for substring in all_sub_strings(domain): if substring in dictionary: found.append(substring) from collections import Counter c=Counter(found) #this is what you want print c

+1

robert king Aug 1 '11 at 12:07

source share

Lauritz V. Thaulow · Accepted Answer · 2011-08-01T11:36:44+0000

We are trying to split the domain name ( s ) into any number of words (not only 2) from the set of known words ( words ). Recursion ftw!

 def substrings_in_set(s, words): if s in words: yield [s] for i in range(1, len(s)): if s[:i] not in words: continue for rest in substrings_in_set(s[i:], words): yield [s[:i]] + rest

This iterator function first returns the string with which it is called if it is in words . Then it breaks the string into two parts in every possible way. If the first part is not in words , she tries the next split. If so, the first part is added to all the results of calling yourself in the second part (which may be nothing, for example, in ["example", "cart", ...])

Then we create an English dictionary:

 # Assuming Linux. Word list may also be at /usr/dict/words. # If not on Linux, grab yourself an enlish word list and insert here: words = set(x.strip().lower() for x in open("/usr/share/dict/words").readlines()) # The above english dictionary for some reason lists all single letters as words. # Remove all except "i" and "u" (remember a string is an iterable, which means # that set("abc") == set(["a", "b", "c"])). words -= set("bcdefghjklmnopqrstvwxyz") # If there are more words we don't like, we remove them like this: words -= set(("ex", "rs", "ra", "frobnicate")) # We may also add words that we do want to recognize. Now the domain name # slartibartfast4ever.co.uk will be properly counted, for instance. words |= set(("4", "2", "slartibartfast"))

Now we can put it all together:

 count = {} no_match = [] domains = ["examplecartrading.com", "examplepensions.co.uk", "exampledeals.org", "examplesummeroffers.com"] # Assume domains is the list of domain names ["examplecartrading.com", ...] for domain in domains: # Extract the part in front of the first ".", and make it lower case name = domain.partition(".")[0].lower() found = set() for split in substrings_in_set(name, words): found |= set(split) for word in found: count[word] = count.get(word, 0) + 1 if not found: no_match.append(name) print count print "No match found for:", no_match

Result: {'ions': 1, 'pens': 1, 'summer': 1, 'car': 1, 'pensions': 1, 'deals': 1, 'offers': 1, 'trading': 1, 'example': 4}

Using set to store an English dictionary allows you to quickly verify membership. -= removes elements from the set, |= adds to it.

Using the all function together with a generator expression improves efficiency, since all returns the first False .

Some substrings can be a valid word, either integer or separated, for example, "example" / "ex" + "enough". In some cases, we can solve the problem by eliminating unnecessary words such as "ex" in the above code example. For others, for example, “pensions” / “pens” + “ions”, this can be inevitable, and when this happens, we need to prevent all other words in the line from being counted several times (once for “pensions” and one times for "pens" + "ions"). We do this by keeping track of the words found for each domain name in the set — set ignoring duplicates — and then counting the words when they are all found.

EDIT: Restructured and added a lot of comments. Forced lowercase strings to avoid omissions due to capitalization. A list has also been added to track domain names where there was no word combination.

INSUFFICIENCY: The substring function has been changed to scale better. The old version has become ridiculously slow for domain names longer than 16 characters or so. Using only the four domain names above, I improved my own run time from 3.6 seconds to 0.2 seconds!

Breaking a line into separate words (Python)

More articles: