Effective methods for finding the most common phrases in the text.

Question

Effective methods for finding the most common phrases in the text.

Earlier I asked a similar question on this topic, I got several solutions that worked, one based on flowering filters + ngrams, the other based on hash tables + ngrams. Both solutions do very well with small data sets (<1000 texts, usually tweets), but the computational time exponentially means that 10,000 can take hours.

I am currently working in Ruby and maybe this is a problem, but are there any other solutions or approaches that I could try to solve this problem?

+4

algorithm ruby data-structures data-mining

benofsky Jul 27 '10 at 19:35

source share

2 answers

smnirven · Answer 1 · 2010-07-27T20:20:20+0000

If you are looking for text search in large datasets, you might have to think about something like solr. There is a very simple solr gem setup called sunspot http://outoftime.github.com/sunspot/

Thanh dk · Answer 2 · 2014-03-17T08:03:31+0000

Your problem can be solved by following these steps:

(Optional, for performance purposes) Skip all documents, create a mapping between a unique word and an integer. In addition, it is better to create a special mapping to complete the sentence (.!? Etc.). This makes it easier to check phrases that do not cross the boundary of a sentence.
Combine all documents into a huge array of integers (in the previous step). This can be done online (to save space) when we move on to the next steps.
Building an array of string suffixes in the previous step, complemented by the longest common prefix array . The fastest implementation is SA-IS, which runs in O (n) in the worst case. See here . Some special processing is required to ensure that each common prefix does not cross the boundary of a sentence.
The LCP array is basically the result you need. You can do whatever you want with it, for example: sort it to find the longest repeating phrases among documents, find all 5 words, 4 words, three-word phrases, etc. The most common phrases (I guess at least 2 -word) can be found by looking at both the LCP and the suffix array.

A quick Google search reveals that this library contains an implementation of the Ruby suffix array. You can create an LCP array from there in the O (n) Reference .

Effective methods for finding the most common phrases in the text.

More articles: