Need an algorithm to search for almost duplicated text values

I am launching a website with photos where users can enter any tag that they like, even tags that have not been used before. As a result, a tag’s photo can sometimes be labeled as “insect,” while someone else marks it as “insects.”

I would like to keep the possibility of free tagging, but I would like to have a way to filter out such almost duplicates. The total collection of tags is currently 1,500. My idea is to read all of them from the database into memory, and then run an algorithm on it that displays “suspects”.

My idea of ​​the suspect is that x% of the characters in the string are the same (same char and order) where x is being configured. I could probably code a really inefficient way to do this, but I was wondering if there is an existing solution to this problem?

Edit: Forgot to mention: just sorting the tags is not enough, as this will require me to go through the entire set to find the cheats.

+4
source share
4 answers

There are some flaws in your logic. For example, what happens when the plural of an object is different from the singular (for example, a person or people, or even candy versus candy).

If English is the primary language, check out Soundex , which allows you to perform phonetic matches. Also consider using a crowd-built synonym model where users can create links to existing tags.

+2
source

Perhaps the algorithm you are looking for is an approximate string match. http://en.wikipedia.org/wiki/Approximate_string_matching .

With this word, you can match it with a list of words, and if the "distance" is close, add it to the suspects.

A quick implementation is to use dynamic programming such as the Needleman-Wunsch algorithm. I made an example blog in C # where you can adjust the "distance" using the matrix character search file. http://kunuk.wordpress.com/2010/10/17/dynamic-programming-example-with-c-using-needleman-wunsch-algorithm/

+2
source

Is "either" good? You can make an SQL query something like this if your images are in a database (which would make sense):

SELECT * FROM ImageTags WHERE INSTR('theNewTag', TagName) > 0 OR INSTR(TagName, 'theNewTag') > 0 LIMIT 1; 
0
source

If you really want to do this efficiently, I would suggest some kind of JavaScript implementation that displays the possibilities when the user enters the tag that they want. It will not only save the user’s time, so he will be happy to see 5 sentences as they are entered. It will automatically stop them from entering “suspects” when the “suspect” appears as an offer. This, of course, if they really do not want the “suspects” to be urgent.

You can download a huge list of words, and custom types narrow them down. I feel this can be very simplified esp if you want to anticipate correctly spelled words. If someone misses the letter, they will probably come back to fix it after seeing a list of offers that are not at all what they wanted to print. And when they type the word correctly, it will appear in sentences.

0
source

All Articles