Is there a better way to find the most common word in a list (Python only)

Question

Is there a better way to find the most common word in a list (Python only)

Given the trivial implementation of the problem, I'm looking for a much faster way to find the most common word in the Python list. As part of a Python interview, I received feedback that this implementation is so inefficient that it is basically a failure. Later I tried many algorithms that I found, and only some heapsearch-based solutions are a little faster, but not in the vast majority (when scaling to tens of millions of units, heapsearch is about 30% faster, at trivial lengths such as thousands, it's almost same using timeit).

def stupid(words): freqs = {} for w in words: freqs[w] = freqs.get(w, 0) + 1 return max(freqs, key=freqs.get)

Since this is a simple problem, and I have some experience (although I am not a guru algorithm or a competitive encoder anywhere), I was surprised.

Of course, I would like to improve my skills and find out that there is a much better way to solve the problem, so your contribution will be appreciated.

Clarification for recurring status: my task is to find out if there is actually a lot (asymptotically) a better solution, and other similar questions have picked up an answer that is not much better. If this is not enough to make the question unique, be sure to close this question.

Update

Thanks to everyone for input. Regarding the interview situation, I am left with the impression that the expected manual search algorithm (which may be somewhat more efficient) and / or the reviewer evaluated the code from the point of view of another language with different constant factors. Of course, everyone can have their own standards.

It was important for me to check how completely shameless I was (I got the impression that this was not so) or they just wrote not the best code. It is still possible that an even better algorithm exists, but if it has remained hidden to the community here for several days, I am fine with that.

I choose the most correct answer - it seems to be true, although more than one person shared useful feedback.

Minor update

Using defaultdict seems to have a noticeable advantage over using the get method, even if it is statically smoothed.

+6

python algorithm

Petar donchev Jul 08 '15 at 9:00

source share

6 answers

 from collections import Counter word_counter = Counter(words)

word_counter - a dictionary with words as keys and frequencies as values, and also has the most_common() method.

+1

Deep space Jul 08 '15 at 9:06

source share

Functional calls and finding a global namespace are more expensive.

Your stupid function makes 2 function calls for each item in the word list. The second in your max call can be completely avoided, iterating over the dict keys, and then for every key looking at the value with dict.get , is a blatant inefficiency when you can dict.get over key-value pairs.

 def stupid(words): freqs = {} for w in words: freqs[w] = freqs.get(w, 0) + 1 return max(freqs, key=freqs.get) def most_frequent(words): ## Build the frequency dict freqs = {} for w in words: if w in freqs: freqs[w] += 1 else: freqs[w] = 1 ## Search the frequency dict m_k = None m_v = 0 for k, v in freqs.iteritems(): if v > m_v: m_k, m_v = k, v return m_k, m_v

Using the single-user offer user1952500, how is the tariff for your large samples?

 def faster(words): freq = {} m_k = None m_v = 0 for w in words: if w in freq: v = freq[w] + 1 else: v = 1 freq[w] = v if v > m_v: m_k = w m_v = v return m_k, m_v

This has a slight advantage in that it is stable for several of the most common values.

Comparison of all sentences using nltk.books to create a sample:

 def word_frequency_version1(words): """Petar initial""" freqs = {} for w in words: freqs[w] = freqs.get(w, 0) + 1 return max(freqs, key=freqs.get) def word_frequency_version2(words): """Matt initial""" ## Build the frequency dict freqs = {} for w in words: if w in freqs: freqs[w] += 1 else: freqs[w] = 1 ## Search the frequency dict m_k = None m_v = 0 for k, v in freqs.iteritems(): if v > m_v: m_k, m_v = k, v return m_k, m_v def word_frequency_version3(words): """Noting max as we go""" freq = {} m_k = None m_v = 0 for w in words: if w in freq: v = freq[w] + 1 else: v = 1 freq[w] = v if v > m_v: m_k = w m_v = v return m_k, m_v from collections import Counter def word_frequency_version4(words): """Built-in Counter""" c = Counter(words) return c.most_common()[0] from multiprocessing import Pool def chunked(seq,count): v = len(seq) / count for i in range(count): yield seq[i*v:v+i*v] def frequency_map(words): freq = {} for w in words: if w in freq: freq[w] += 1 else: freq[w] = 1 return freq def frequency_reduce(results): freq = {} for result in results: for k, v in result.iteritems(): if k in freq: freq[k] += v else: freq[k] = v m_k = None m_v = None for k, v in freq.iteritems(): if v > m_v: m_k = k m_v = v return m_k, m_v # def word_frequency_version5(words,chunks=5,pool_size=5): # pool = Pool(processes=pool_size) # result = frequency_reduce(pool.map(frequency_map,chunked(words,chunks))) # pool.close() # return result def word_frequency_version5(words,chunks=5,pool=Pool(processes=5)): """multiprocessing Matt initial suggestion""" return frequency_reduce(pool.map(frequency_map,chunked(words,chunks))) def word_frequency_version6(words): """Petar one-liner""" return max(set(words),key=words.count) import timeit freq1 = timeit.Timer('func(words)','from __main__ import words, word_frequency_version1 as func; print func.__doc__') freq2 = timeit.Timer('func(words)','from __main__ import words, word_frequency_version2 as func; print func.__doc__') freq3 = timeit.Timer('func(words)','from __main__ import words, word_frequency_version3 as func; print func.__doc__') freq4 = timeit.Timer('func(words)','from __main__ import words, word_frequency_version4 as func; print func.__doc__') freq5 = timeit.Timer('func(words,chunks=chunks)','from __main__ import words, word_frequency_version5 as func; print func.__doc__; chunks=10') freq6 = timeit.Timer('func(words)','from __main__ import words, word_frequency_version6 as func; print func.__doc__')

Results:

 >>> print "n={n}, m={m}".format(n=len(words),m=len(set(words))) n=692766, m=34464 >>> freq1.timeit(10) "Petar initial" 3.914874792098999 >>> freq2.timeit(10) "Matt initial" 3.8329160213470459 >>> freq3.timeit(10) "Noting max as we go" 4.1247420310974121 >>> freq4.timeit(10) "Built-in Counter" 6.1084718704223633 >>> freq5.timeit(10) "multiprocessing Matt initial suggestion" 9.7867341041564941

Notes:

I cheat with an instance of multiprocessing.Pool as kwarg for synchronization purposes, since I wanted to avoid pool pool timeit , and timeit does not allow timeit to specify a cleanup code. This was done on a "quad" CPU, I am sure that for some values of the input data and the processor it is calculated that multiprocessing will be faster.
For the most part, return the most high-frequency word, which can be random if there is a connection for the first place.
Approximations of the highest frequency may be faster (using sampling), but will be approximate.
Version 6 (single line) should be ignored for large n*m values.

+1

Matth Jul 08 '15 at 9:13

source share

You must go through all the words at least once by giving Omega (n). Saving the meanings that you currently use for every other word gives Omega (log n).

If you find a repository (get / set) that is Omega (1) for different words, you can create a solution using Omega (n). As far as I know, we only have Omega (log n) solutions for such storage (regardless of type: heap, map, tree, dict, set ...).

EDIT (check comments): [Your solution is O (n log n) because of dictionary check] + O (n) because of max (), which makes it O (n log n) of everything .... This is normal .

As far as I know, this (complexity wise) is a good solution. You can improve the performance of using different types of storage, such as syntax trees or heaps. But complexity must remain unchanged.

EDIT: From the discussion of the comments, you can get the average and amortized Omega (n) with hashtable.

+1

Darwly Jul 08 '15 at 9:19

source share

Your dictionary / counter solution looks good to me. Its advantage is that you can perform a parallel counting step.

Another obvious algorithm:

List sorting
Scroll through the list, counting duplicate values, writing down the longest run so far

This has a time complexity of O (n log n), where n is the length of the list.

0

Colonel panic Jul 08 '15 at 11:16

source share

Obviously, you need to look at every word in words , so this can only be a search at the end of the problem. Would it be an option to keep an additional link to the most common word? Sort of:

 def stupid(words): freqs = {} most = None for w in words: word_freq = freqs.get(w, 0) + 1 if most is None or word_freq > most[0]: most = (word_freq, w) freqs[w] = word_freq return most if most is None else most[1]

This, of course, will use extra space, but avoid searching.

-1

reikje Jul 08 '15 at 9:16

source share

taleinat · Accepted Answer · 2015-07-08T09:08:23+0000

This sounds like a bad interview question, probably the case of an interviewer awaiting a definite answer. It definitely sounds as if he / she did not clearly explain what he was asking.

Your solution is O(n) (where n = len(words) ), and using heap doesn't change that.

Decisions are approaching faster ...

Is there a better way to find the most common word in a list (Python only)

More articles: