Earlier I asked a similar question on this topic, I got several solutions that worked, one based on flowering filters + ngrams, the other based on hash tables + ngrams. Both solutions do very well with small data sets (<1000 texts, usually tweets), but the computational time exponentially means that 10,000 can take hours.
I am currently working in Ruby and maybe this is a problem, but are there any other solutions or approaches that I could try to solve this problem?
source share