You can use Apache Lucene, but depending on the use case, this may be too heavy. For very simple fuzzy searches, it can be a little difficult to use and (fix me if I'm wrong), this requires you to build an index.
If you need a simple online algorithm (= not supporting an index), you can use the fuzzy Bitap algorithm . I found an implementation in Java here . This code is suitable for one relatively short method with an almost self-evident signature:
public static List<Integer> find(String doc, String pattern, int k)
Apache Commons StringUtils has an implementation of the Levenshtein algorithm for fuzzy string matching. It can be considered as a fuzzy version of String.equals , the bit file is similar to a fuzzy version of String.indexOf and still uses the Levenshtein distance measure. Usually more efficient than naive, using Levenshtein to compare a search pattern with each substring that might match.
Notes :
- The Bitap algorithm seems to be mostly useful for relatively small alphabets, for example. simple ASCII. In fact, the version of Simon Watiau that I am associated with throws an
ArrayIndexOutOfBoundsException on non-ASCII characters (> = 128), so you have to filter them out. I tried using Bimap in the application to search the list in the list of people by name. I found that Levenshtein 2 distance gives too many false positives. Levenshtein’s distance from 1 works is better, but he cannot detect a typo where you change two letters, for example. "William" and "William." I can come up with several ways to solve this problem, for example
- perform a fuzzy search only if the exact search does not find matches (and displays a message to the user about this)
- configure Bitap to use Damerau-Levenshtein distance, where the swap has a distance of 1 instead of 2. According to wikipedia , this is possible, but I could not find an existing implementation in Java.
- instead of "contains" do "startsWith". fuzzy search tools contain a prefix version of Damerau-Levenshtein, but it gave me an
ArrayIndexOutOfBoundsException - tune the algorithm to enter search results ranking where exact matches are ranked higher
If you're going to do 2 or 4, it might be better to use the right full-text search library like Lucene anyway.
- More information on fuzzy searches can be found on this blog . This author also created an implementation in Java called
BitapOnlineSearcher , but you need to use java.io.Reader along with the alphabet class. This Javadoc is written in Russian.
Henno Vermeulen Oct. 14 '15 at 12:43 2015-10-14 12:43
source share