Grouping Rows by Similarity

I have an array of strings, not so many (maybe a few hundred), but often long (a few hundred characters).

These lines are usually nonsense and different from each other. But in the group of these lines, maybe 5 out of 300, there is a great similarity. In fact, this is the same line that distinguishes formatting, punctuation and a few words.

How can I handle this group of strings?

By the way, I write in ruby, but if nothing else, the algorithm in the pseudo-code will be fine.

thanks

+6
string algorithm ruby grouping similarity
source share
4 answers

There are many ways to compare strings for similarities.

Here is a site with various similarity indicators that you can use.

Here is a Wikipedia article with various similarity indicators that you can use.

+4
source share
+2
source share

Assuming you are not worried about mistakes or other mistakes in each word, you can do the following:

Build an inverted index, which is basically a hash entered by word, pointing to a list of pointers to lines that contain that word (how you handle duplicate entries is up to you). To determine strings that are similar to a given query string, find each query word in the index and for each source row in the result lists, count how many times the source string is displayed in each list. Lines with the highest number of samples are your best candidates for similarity, because they contain most common words.

Then you can calculate the editing distance between the two lines or any other metric you want. This way you avoid the complexity of O (n ^ 2) for comparing each row with all the other rows.

+1
source share

This may be excessive and perhaps not quite suitable for what you want to achieve, but you can use "Ferret" to help (Rubene version of Lucene is a full-text index / search API) for sorting from punctuation and formatting - also if sentences differ in the usual "stop words" (and ,, ...), they can be filtered out.

Then your queries will be assigned weights: this gives an idea of ​​the similarities.

http://www.davebalmain.com/ http://www.amazon.co.uk/Ferret-David-Balmain/dp/0596519400/ref=sr_1_2?ie=UTF8&s=books&qid=1264751909&sr=8-2

0
source share

All Articles