I am launching a website with photos where users can enter any tag that they like, even tags that have not been used before. As a result, a tag’s photo can sometimes be labeled as “insect,” while someone else marks it as “insects.”
I would like to keep the possibility of free tagging, but I would like to have a way to filter out such almost duplicates. The total collection of tags is currently 1,500. My idea is to read all of them from the database into memory, and then run an algorithm on it that displays “suspects”.
My idea of the suspect is that x% of the characters in the string are the same (same char and order) where x is being configured. I could probably code a really inefficient way to do this, but I was wondering if there is an existing solution to this problem?
Edit: Forgot to mention: just sorting the tags is not enough, as this will require me to go through the entire set to find the cheats.
Ferdy source share