An effective algorithm for finding the most common phrases in a large volume of text

I am thinking of writing a program to collect for me the most common phrases in a large volume of text. If the problem was reduced to a simple word search, it would be as simple as storing each new word in a hash map, and then increasing the counter in each case. But with phrases, saving each permutation of a sentence as a key seems impracticable.

Basically the problem narrows to figure out how to extract any possible phrase from a sufficiently large text. Counting phrases and then sorting by the number of entries becomes trivial.

+8
algorithm data-structures frequency word-frequency
source share
1 answer

I assume that you are looking for common patterns of consecutive words in the same order (for example, “top of the world” will not be considered the same phrase as “top of the world” or “world of top”,).

If so, I would recommend the following linear time approach:

  • Divide the text into words and delete those things that you do not consider significant (for example, remove capital letters, punctuation marks, word breaks, etc.).
  • Convert text to an array of integers (one integer per unique word) (for example, each instance of "cat" becomes 1, each "dog" becomes 2) This can be done in linear time using a hash-based dictionary to store conversions from words to numbers . If the word is not in the dictionary, then assign a new identifier.
  • Build a suffix array for an array of integers (this is a sorted list of all suffixes of your array and can be constructed by linear time - for example, using the algorithm and C code here )
  • Create the longest common prefix array for your suffix array. (This can also be done in linear time, for example, using this C code ). This LCP array gives the number of common words at the beginning of each suffix between consecutive pairs in the suffix array.

Now you can collect your common phrases.

It's not entirely clear how you want to determine the end of a phrase. One possibility is simply to collect all the sequences of four words that are repeated. This can be done in linear time, working through your suffix array, looking at places where the longest common prefix array is = = 4. Each run of indices x in the range [start + 1 ... start + len], where LCP [x ]> = 4 (for all but the last value of x) corresponds to a phrase that is repeated len times. The phrase itself is defined by the first four words, for example, the suffix start + 1.

Note that this approach will potentially contain phrases that end the sentence. You may prefer to convert some punctuation marks, such as complete stops, to unique integers to prevent this.

+8
source share

All Articles