Effective methods for finding the most common phrases in the text.

Earlier I asked a similar question on this topic, I got several solutions that worked, one based on flowering filters + ngrams, the other based on hash tables + ngrams. Both solutions do very well with small data sets (<1000 texts, usually tweets), but the computational time exponentially means that 10,000 can take hours.

I am currently working in Ruby and maybe this is a problem, but are there any other solutions or approaches that I could try to solve this problem?

+4
source share
2 answers

If you are looking for text search in large datasets, you might have to think about something like solr. There is a very simple solr gem setup called sunspot http://outoftime.github.com/sunspot/

+1
source

Your problem can be solved by following these steps:

  • (Optional, for performance purposes) Skip all documents, create a mapping between a unique word and an integer. In addition, it is better to create a special mapping to complete the sentence (.!? Etc.). This makes it easier to check phrases that do not cross the boundary of a sentence.
  • Combine all documents into a huge array of integers (in the previous step). This can be done online (to save space) when we move on to the next steps.
  • Building an array of string suffixes in the previous step, complemented by the longest common prefix array . The fastest implementation is SA-IS, which runs in O (n) in the worst case. See here . Some special processing is required to ensure that each common prefix does not cross the boundary of a sentence.
  • The LCP array is basically the result you need. You can do whatever you want with it, for example: sort it to find the longest repeating phrases among documents, find all 5 words, 4 words, three-word phrases, etc. The most common phrases (I guess at least 2 -word) can be found by looking at both the LCP and the suffix array.

A quick Google search reveals that this library contains an implementation of the Ruby suffix array. You can create an LCP array from there in the O (n) Reference .

0
source

All Articles