Detect duplicate / similar text among large datasets?

I have a large database with thousands of records. Every time a user posts his information, I need to know if there is already the same / similar post. Are there any open source algorithms or implementations to solve this problem?

We use Chinese, and what “looks like” means that the posts have the most identical content, can be 80% -100%. Each record will not be too big, about 2k-6k bytes

+3
algorithm similarity
source share
4 answers

This answer relates to a class with a very high degree of complexity (in the worst case, it is a quintic, the expected case when it is fourth to check your database for the first time, and then quarter / cube to add a record), so it does not scale enough, to Unfortunately, there is not a much better answer that I can think of right now.

The algorithm is called the Ratcliff-Obershelp algorithm , it is implemented in python difflib . The algorithm itself is the worst case of cubic time and the quadratic expected. Then you must do this for every possible pair of records that is quadratic. Of course, adding a record is linear.

EDIT: Sorry, I am not reading the documentation correctly, difflib is quadratic, not cubic. Use it, not another algorithm.

+1
source share

Take a look at the shngle-min-hash methods. Here is a presentation that can help you.

+1
source share

One approach that I used to do something like this is to build a search index in the usual based on word statistics, and then use the new element as if it were a search with this index - if the rating for the top element is in the search too tall, then the new item is too similar. Undoubtedly, some of the standard text search libraries can be used for this, although if it's just a few thousand entries, it's pretty trivial to create your own.

0
source share

All Articles